Background:
Now, clang implements developer controled prefetching by supporting __builtin_prefetch()
. But we found that the clang can directly convert function calls into intrinsics which will have impact on some mid-end optimizations, such as preventing vectorization. In addition, the prefetch distance of the __builtin_prefetch()
method is fixed. Here we propose a prefetch pragma to address the shortcomings of the __builtin_prefetch()
. In our pragma design, the prefetch distance can be dynamically adjusted according to the value of the VF and it won’t have any side effect on other passes. For example:
// prefetch method 2:
// #pragma clang loop prefetch(a, 1, 16)
for (int i = 0; i < n; ++i) {
b[i] = a[i];
// Assume this prefetch distance is just what the program needs.
// prefetch method 1:
__builtin_prefetch(&a[i+16], 0, 3);
}
-
Loop with prefetch method 1 after loop vectorization (
opt -passes=loop-vectorize ...
):... for.body: ; preds = %for.body.preheader, %for.body %indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ] %3 = add nuw nsw i64 %indvars.iv, 16 %arrayidx4 = getelementptr inbounds i32, i32* %a, i64 %3 %4 = bitcast i32* %arrayidx4 to i8* tail call void @llvm.prefetch.p0i8(i8* nonnull %4, i32 0, i32 3, i32 1) %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1 %exitcond.not = icmp eq i64 %indvars.iv.next, %0 br i1 %exitcond.not, label %for.cond.cleanup, label %for.body, !llvm.loop !10 } ...
The intrinsic call(
@llvm.prefetch.p0()
) prevent loop vectorization. -
Loop with prefetch method 2 after loop vectorization and loop data prefetch (
opt -passes="loop-vectorize,loop-data-prefetch" ...
)... vector.body: ; preds = %vector.body, %vector.ph %indvars.iv = phi i64 [ %indvars.iv.next, %vector.body ], [ 0, %vector.ph ] %indvar2 = phi i64 [ %indvar.next3, %vector.body ], [ 0, %vector.ph ] %4 = shl nuw nsw i64 %indvar2, 5 %5 = add nuw nsw i64 %4, 512 %uglygep4 = getelementptr i8, ptr %a, i64 %5 %6 = getelementptr inbounds i32, ptr %a, i64 %indvars.iv tail call void @llvm.prefetch.p0(ptr %uglygep4, i32 0, i32 3, i32 1) %wide.load = load <4 x i32>, ptr %6, align 4, !tbaa !6, !llvm.loop.prefetch !10 %7 = getelementptr inbounds i32, ptr %6, i64 4 %wide.load1 = load <4 x i32>, ptr %7, align 4, !tbaa !6, !llvm.loop.prefetch !10 %8 = getelementptr inbounds i32, ptr %b, i64 %indvars.iv store <4 x i32> %wide.load, ptr %8, align 4, !tbaa !6 %9 = getelementptr inbounds i32, ptr %8, i64 4 store <4 x i32> %wide.load1, ptr %9, align 4, !tbaa !6 %indvars.iv.next = add nuw nsw i64 %indvars.iv, 8 %10 = icmp eq i64 %indvars.iv.next, %3 %indvar.next3 = add nuw nsw i64 %indvar2, 1 br i1 %10, label %scalar.ph, label %vector.body, !llvm.loop !11 ... !10 = distinct !{i1 true, i32 1, i32 16}
With prefetch method 2, the loop can be vectorized, and the prefetch distance is adjusted from 16 to 128 according to the VF of loop vectorization.
I hope we can provide a pragma based data prefetch control method for developers. In the way we conceived, the hint information provided by pragma is converted into metadata, so that it can be used with little impact on other optimization passes. This will not have side effects on vectorization and the prefetch distance can be dynamically adjusted based on loop iterations. In addition, the prefetch control with pragma is more convenient for project migration.
I hope pragma and builtin based prefetcher can be used together to provide better control for data prefetching.
Proposed Semantics:
-
#pragma clang loop noprefetch(variable)
-
variable
: a memory reference(data to be prefetched), must be a declared pointer/array variable.
-
-
#pragma clang loop prefetch(variable, level, distance)
-
variable
: a memory reference(data to be prefetched), must be a declared pointer/array variable. -
level
: an optional value to the compiler to specify the type of prefetch. To use this argument, you must also specifyvariable
.
‘0’: data will not be reused
‘1’: L1 cache
‘2’: L2 cache
‘3’: L3 cache -
distance
: an optional integer argument with a value greater than 0. It indicates the number of loop iterations ahead of which a prefetch is issued before the corresponding load or store instruction. To use this argument, you must also specifyvariable
andlevel
.
-
Rules of pragma:
- Support for prefetching mulitple data:
#pragma clang loop prefetch(a)
#pragma clang loop prefetch(b)
#pragma clang loop noprefetch(c)
for (int i = 0; i < n; ++i) {
// prefetch a and b, noprefetch c.
res = a[i] + b[i] + c[i] + d[i];
}
- Support nested loops:
#pragma clang loop prefetch(a)
for (int i = 0; i < n; ++i) {
res += a[i];
#pragma clang loop prefetch(b)
for (int j = 0; j < n; ++j) {
res += b[i]
}
}
Metadata implementation
!llvm.loop.prefetch !0
!0 = distinct !{i1 true/false, i32 level, i32 distance}
1st arg: true: prefetch; false: noprefetch
2nd arg: -1: default; 0-3: cache level
3rd arg: -1: default; 1-INT_MAX: iterations ahead
A revision with the implementation can be found in D144377, D144378