[RFC] Adding support #pragma clang loop [no]prefetch() for prefetch


Now, clang implements developer controled prefetching by supporting __builtin_prefetch(). But we found that the clang can directly convert function calls into intrinsics which will have impact on some mid-end optimizations, such as preventing vectorization. In addition, the prefetch distance of the __builtin_prefetch() method is fixed. Here we propose a prefetch pragma to address the shortcomings of the __builtin_prefetch(). In our pragma design, the prefetch distance can be dynamically adjusted according to the value of the VF and it won’t have any side effect on other passes. For example:

// prefetch method 2:
// #pragma clang loop prefetch(a, 1, 16)
for (int i = 0; i < n; ++i) {
  b[i] = a[i];
  // Assume this prefetch distance is just what the program needs.
  // prefetch method 1:
  __builtin_prefetch(&a[i+16], 0, 3);
  1. Loop with prefetch method 1 after loop vectorization (opt -passes=loop-vectorize ...):

    for.body:                                         ; preds = %for.body.preheader, %for.body
      %indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
      %3 = add nuw nsw i64 %indvars.iv, 16
      %arrayidx4 = getelementptr inbounds i32, i32* %a, i64 %3
      %4 = bitcast i32* %arrayidx4 to i8*
      tail call void @llvm.prefetch.p0i8(i8* nonnull %4, i32 0, i32 3, i32 1)
      %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
      %exitcond.not = icmp eq i64 %indvars.iv.next, %0
      br i1 %exitcond.not, label %for.cond.cleanup, label %for.body, !llvm.loop !10

    The intrinsic call(@llvm.prefetch.p0()) prevent loop vectorization.

  2. Loop with prefetch method 2 after loop vectorization and loop data prefetch (opt -passes="loop-vectorize,loop-data-prefetch" ...)

    vector.body:                                      ; preds = %vector.body, %vector.ph
      %indvars.iv = phi i64 [ %indvars.iv.next, %vector.body ], [ 0, %vector.ph ]
      %indvar2 = phi i64 [ %indvar.next3, %vector.body ], [ 0, %vector.ph ]
      %4 = shl nuw nsw i64 %indvar2, 5
      %5 = add nuw nsw i64 %4, 512
      %uglygep4 = getelementptr i8, ptr %a, i64 %5
      %6 = getelementptr inbounds i32, ptr %a, i64 %indvars.iv
      tail call void @llvm.prefetch.p0(ptr %uglygep4, i32 0, i32 3, i32 1)
      %wide.load = load <4 x i32>, ptr %6, align 4, !tbaa !6, !llvm.loop.prefetch !10
      %7 = getelementptr inbounds i32, ptr %6, i64 4
      %wide.load1 = load <4 x i32>, ptr %7, align 4, !tbaa !6, !llvm.loop.prefetch !10
      %8 = getelementptr inbounds i32, ptr %b, i64 %indvars.iv
      store <4 x i32> %wide.load, ptr %8, align 4, !tbaa !6
      %9 = getelementptr inbounds i32, ptr %8, i64 4
      store <4 x i32> %wide.load1, ptr %9, align 4, !tbaa !6
      %indvars.iv.next = add nuw nsw i64 %indvars.iv, 8
      %10 = icmp eq i64 %indvars.iv.next, %3
      %indvar.next3 = add nuw nsw i64 %indvar2, 1
      br i1 %10, label %scalar.ph, label %vector.body, !llvm.loop !11
    !10 = distinct !{i1 true, i32 1, i32 16}

    With prefetch method 2, the loop can be vectorized, and the prefetch distance is adjusted from 16 to 128 according to the VF of loop vectorization.

I hope we can provide a pragma based data prefetch control method for developers. In the way we conceived, the hint information provided by pragma is converted into metadata, so that it can be used with little impact on other optimization passes. This will not have side effects on vectorization and the prefetch distance can be dynamically adjusted based on loop iterations. In addition, the prefetch control with pragma is more convenient for project migration.
I hope pragma and builtin based prefetcher can be used together to provide better control for data prefetching.

Proposed Semantics:

  1. #pragma clang loop noprefetch(variable)
    1. variable: a memory reference(data to be prefetched), must be a declared pointer/array variable.
  2. #pragma clang loop prefetch(variable, level, distance)
    1. variable: a memory reference(data to be prefetched), must be a declared pointer/array variable.
    2. level: an optional value to the compiler to specify the type of prefetch. To use this argument, you must also specify variable.
      ‘0’: data will not be reused
      ‘1’: L1 cache
      ‘2’: L2 cache
      ‘3’: L3 cache
    3. distance: an optional integer argument with a value greater than 0. It indicates the number of loop iterations ahead of which a prefetch is issued before the corresponding load or store instruction. To use this argument, you must also specify variable and level.

Rules of pragma:

  1. Support for prefetching mulitple data:
#pragma clang loop prefetch(a)
#pragma clang loop prefetch(b)
#pragma clang loop noprefetch(c)
for (int i = 0; i < n; ++i) {
   // prefetch a and b, noprefetch c.
   res = a[i] + b[i] + c[i] + d[i];
  1. Support nested loops:
#pragma clang loop prefetch(a)
for (int i = 0; i < n; ++i) {
   res += a[i];
   #pragma clang loop prefetch(b)
   for (int j = 0; j < n; ++j) {
     res += b[i]

Metadata implementation

!llvm.loop.prefetch !0
!0 = distinct !{i1 true/false, i32 level, i32 distance}

1st arg: true: prefetch; false: noprefetch
2nd arg: -1: default; 0-3: cache level
3rd arg: -1: default; 1-INT_MAX: iterations ahead

A revision with the implementation can be found in D144377, D144378

So my concerns are two fold; First, that this doesn’t model the IR well, and second, that it doesn’t model the builtin accurately.

I don’t particularly like trying to use metadata for something like this as it tends to be pretty ethereal, I would expect us to use a call to intrinsic on the variable like the builtin does. I realize this likely means modelling the pragma as attached to the loop, and having it be a property of the loop during IR-CodeGen (which I THINK we do elsewhere?). Though I DO have concerns on how this could be modeled for ‘iterations’.

Second the builtin uses a ‘read-write’ flag, and this doesn’t seem like it considers that, and seems to add ‘other’ things that aren’t particularly similar. So I guess I have a problem with calling it the same thing, if the pragma isn’t going to be the ‘same thing’.

I’m hopeful when Aaron comes along he can better respond to this, but I’m a little suspicious of pragmas like this (and frankly, builtins like this) in general, I find them very difficult to use correctly/usefully, and often punish users who don’t do a ‘perfect’ job silently.

FWIW, that’s been my experience as well. For the folks who know they exist and know how to use them properly, they’re nice enough, but for everyone else they’re a perf footgun because we’re assuming the programmer uses them properly. That doesn’t mean we shouldn’t have the option, but it does make me wonder what evidence there is of a user base for the functionality.

Another question I have is: do we need a pragma at all, or can we modify the existing builtin to have an overloaded form that does what we need?

I assume this will work with other loop constructs like range-based for loops in C++ and do/while loops? Or is this limited to simple for loops?

It’s probably best to discuss this under a different topic such as ‘IR & Optimizations’ as well, since clang frontend folks don’t always have a good handle on how IR changes like this will hold up through the various passes. The people that could offer specific feedback may not be reading this topic.

Yeah, what you said makes a lot of sense. This pragma was originally designed for experienced developers. The issue you mentioned is very valuable and I will consider it in depth. Thanks.

Currently, this pramga only works in a simple for/while loop. If we need to support range-based for loops, it can be extended.

1 Like

Perhaps teaching vectorizer to vectorize in presence of prefetch intrinsic would be better, if it’s feasible.

I am now implementing a vectorization of builtin_prefetch (prefetch intrinsic) and will register it with phabricator for review around the end of March 2023.

I am thinking that builtin_prefetch would be useful for prefetching indirect access.

I see some use of a loop pragma, in a scenario like this:

#pragma clang loop prefetch(<trip count ahead>, <locality>, <cache level>)
for (int i = 0; i < n; ++i) {
  f(a[i] + b[2*i]);

where the compiler would insert prefetch instructions for each of the accesses taking account each accesses’ stride and size. That is, a is prefetched sizeof(*a)*<trip count ahead> bytes ahead and sizeof(*b)*2*<trip count ahead> for b.

The other way is a builtin where the the user has to specify each pointer and the prefetch distance manually, such as:

  __builtin_prefetch(&a[i*<trip count ahead>]);
  __builtin_prefetch(&b[2*i*<trip count ahead>]);
  f(a[i] + b[2*i]);

I think it makes sense to have both. Maybe the compiler could derive <cache level> and <locality> argument automatically from the stride+cache sizes so there is less that the user could do wrong?

The proposal here seems to lie between those, where the user has to specify each variable separately in a pragma. For such cases I would also prefer using the builtin, since it is not too hard to compute the byte offset manually when you want more fine-grained control than just prefetch everything. The LoopVectorize pass should be able to handle prefetch intrinsics.

If using an intrinsic/metadata relative to a loop, it should reference the loop it is referring to. This is because instructions can be hoisted/sunk from/into loops or the loop unrolled.

for (int i = 0; i < n; ++i) {
  #pragma clang loop prefetch(a)
  for (int j = 0; j < 4; ++j) {
    f(a[i+j]); !llvm.loop.prefetch

The j-loop might be fully unrolled in which case it the i-loop would become the accesses’ innermost, with a different stride offset. See llvm.loop.parallel_accesses for another metadata that references the loop.

I think I need to redesign this pragma. I intend to perfect this work with the following four pragma.

  1. #pragma clang loop prefetch(disable) LoopDataPrefetch Pass is enabled by default in some CPU architectures, such as aarch64. However, the perfect loop data prefetch policy cannot be always guaranteed. Incorrect prefetch mode may deteriorate program performance. The pragma may be used to disable data prefetching for a specifical loop. I think this will make it easy for developers to do performance debugging.
  2. #pragma clang loop prefetch(enable) The LoopDataPrefetch Pass is allowed to provide loop data prefetch for all loads/stores in the loop as much as possible. The specific prefetch implementation depends on the LoopDataPrefetch capability. The pragma has not yet determined a specific support solution. Maybe I’ll put it at the end for support.
  3. #pragma clang loop prefetch(variable[[, <r/w>, , ]; variable[, …];…]) The variable is a required argument, and the <r/w>, <trip count ahead>, <cache level> arguments are optional. The reason why this pragma is proposed is to take full advantage of the LoopDataPrefetch Pass, On the other hand, it is difficult for the downstream to expand builtin_prefetch, because if the design of builtin_prefetch is different between different compilers, there will be compatibility problems, which will cause the program to fail to compile normally. So for programs that use builtin_prefetch, the migration cost of project developers is very high. I think pragma can well avoid this problem.
  4. #pragma clang loop noprefetch(variable) Provides fine-grained data prefetch control. Currently, at the compiler level, there is no control method that can be specifically used to disable data prefetching of a certain variable. Now, we hope to provide a control method.