[RFC] Adding support #pragma clang loop [no]prefetch() for prefetch

Background:

Now, clang implements developer controled prefetching by supporting __builtin_prefetch(). But we found that the clang can directly convert function calls into intrinsics which will have impact on some mid-end optimizations, such as preventing vectorization. In addition, the prefetch distance of the __builtin_prefetch() method is fixed. Here we propose a prefetch pragma to address the shortcomings of the __builtin_prefetch(). In our pragma design, the prefetch distance can be dynamically adjusted according to the value of the VF and it won’t have any side effect on other passes. For example:

// prefetch method 2:
// #pragma clang loop prefetch(a, 1, 16)
for (int i = 0; i < n; ++i) {
  b[i] = a[i];
  // Assume this prefetch distance is just what the program needs.
  // prefetch method 1:
  __builtin_prefetch(&a[i+16], 0, 3);
}
  1. Loop with prefetch method 1 after loop vectorization (opt -passes=loop-vectorize ...):

    ...
    
    for.body:                                         ; preds = %for.body.preheader, %for.body
      %indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
      %3 = add nuw nsw i64 %indvars.iv, 16
      %arrayidx4 = getelementptr inbounds i32, i32* %a, i64 %3
      %4 = bitcast i32* %arrayidx4 to i8*
      tail call void @llvm.prefetch.p0i8(i8* nonnull %4, i32 0, i32 3, i32 1)
      %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
      %exitcond.not = icmp eq i64 %indvars.iv.next, %0
      br i1 %exitcond.not, label %for.cond.cleanup, label %for.body, !llvm.loop !10
    }
    
    ...
    

    The intrinsic call(@llvm.prefetch.p0()) prevent loop vectorization.

  2. Loop with prefetch method 2 after loop vectorization and loop data prefetch (opt -passes="loop-vectorize,loop-data-prefetch" ...)

    ...
    
    vector.body:                                      ; preds = %vector.body, %vector.ph
      %indvars.iv = phi i64 [ %indvars.iv.next, %vector.body ], [ 0, %vector.ph ]
      %indvar2 = phi i64 [ %indvar.next3, %vector.body ], [ 0, %vector.ph ]
      %4 = shl nuw nsw i64 %indvar2, 5
      %5 = add nuw nsw i64 %4, 512
      %uglygep4 = getelementptr i8, ptr %a, i64 %5
      %6 = getelementptr inbounds i32, ptr %a, i64 %indvars.iv
      tail call void @llvm.prefetch.p0(ptr %uglygep4, i32 0, i32 3, i32 1)
      %wide.load = load <4 x i32>, ptr %6, align 4, !tbaa !6, !llvm.loop.prefetch !10
      %7 = getelementptr inbounds i32, ptr %6, i64 4
      %wide.load1 = load <4 x i32>, ptr %7, align 4, !tbaa !6, !llvm.loop.prefetch !10
      %8 = getelementptr inbounds i32, ptr %b, i64 %indvars.iv
      store <4 x i32> %wide.load, ptr %8, align 4, !tbaa !6
      %9 = getelementptr inbounds i32, ptr %8, i64 4
      store <4 x i32> %wide.load1, ptr %9, align 4, !tbaa !6
      %indvars.iv.next = add nuw nsw i64 %indvars.iv, 8
      %10 = icmp eq i64 %indvars.iv.next, %3
      %indvar.next3 = add nuw nsw i64 %indvar2, 1
      br i1 %10, label %scalar.ph, label %vector.body, !llvm.loop !11
    
    ...
    
    !10 = distinct !{i1 true, i32 1, i32 16}
    

    With prefetch method 2, the loop can be vectorized, and the prefetch distance is adjusted from 16 to 128 according to the VF of loop vectorization.

I hope we can provide a pragma based data prefetch control method for developers. In the way we conceived, the hint information provided by pragma is converted into metadata, so that it can be used with little impact on other optimization passes. This will not have side effects on vectorization and the prefetch distance can be dynamically adjusted based on loop iterations. In addition, the prefetch control with pragma is more convenient for project migration.
I hope pragma and builtin based prefetcher can be used together to provide better control for data prefetching.

Proposed Semantics:

  1. #pragma clang loop noprefetch(variable)
    1. variable: a memory reference(data to be prefetched), must be a declared pointer/array variable.
  2. #pragma clang loop prefetch(variable, level, distance)
    1. variable: a memory reference(data to be prefetched), must be a declared pointer/array variable.
    2. level: an optional value to the compiler to specify the type of prefetch. To use this argument, you must also specify variable.
      ‘0’: data will not be reused
      ‘1’: L1 cache
      ‘2’: L2 cache
      ‘3’: L3 cache
    3. distance: an optional integer argument with a value greater than 0. It indicates the number of loop iterations ahead of which a prefetch is issued before the corresponding load or store instruction. To use this argument, you must also specify variable and level.

Rules of pragma:

  1. Support for prefetching mulitple data:
#pragma clang loop prefetch(a)
#pragma clang loop prefetch(b)
#pragma clang loop noprefetch(c)
for (int i = 0; i < n; ++i) {
   // prefetch a and b, noprefetch c.
   res = a[i] + b[i] + c[i] + d[i];
}
  1. Support nested loops:
#pragma clang loop prefetch(a)
for (int i = 0; i < n; ++i) {
   res += a[i];
   #pragma clang loop prefetch(b)
   for (int j = 0; j < n; ++j) {
     res += b[i]
   }
}

Metadata implementation

!llvm.loop.prefetch !0
!0 = distinct !{i1 true/false, i32 level, i32 distance}

1st arg: true: prefetch; false: noprefetch
2nd arg: -1: default; 0-3: cache level
3rd arg: -1: default; 1-INT_MAX: iterations ahead

A revision with the implementation can be found in D144377, D144378

So my concerns are two fold; First, that this doesn’t model the IR well, and second, that it doesn’t model the builtin accurately.

I don’t particularly like trying to use metadata for something like this as it tends to be pretty ethereal, I would expect us to use a call to intrinsic on the variable like the builtin does. I realize this likely means modelling the pragma as attached to the loop, and having it be a property of the loop during IR-CodeGen (which I THINK we do elsewhere?). Though I DO have concerns on how this could be modeled for ‘iterations’.

Second the builtin uses a ‘read-write’ flag, and this doesn’t seem like it considers that, and seems to add ‘other’ things that aren’t particularly similar. So I guess I have a problem with calling it the same thing, if the pragma isn’t going to be the ‘same thing’.

I’m hopeful when Aaron comes along he can better respond to this, but I’m a little suspicious of pragmas like this (and frankly, builtins like this) in general, I find them very difficult to use correctly/usefully, and often punish users who don’t do a ‘perfect’ job silently.

FWIW, that’s been my experience as well. For the folks who know they exist and know how to use them properly, they’re nice enough, but for everyone else they’re a perf footgun because we’re assuming the programmer uses them properly. That doesn’t mean we shouldn’t have the option, but it does make me wonder what evidence there is of a user base for the functionality.

Another question I have is: do we need a pragma at all, or can we modify the existing builtin to have an overloaded form that does what we need?

I assume this will work with other loop constructs like range-based for loops in C++ and do/while loops? Or is this limited to simple for loops?

It’s probably best to discuss this under a different topic such as ‘IR & Optimizations’ as well, since clang frontend folks don’t always have a good handle on how IR changes like this will hold up through the various passes. The people that could offer specific feedback may not be reading this topic.

Yeah, what you said makes a lot of sense. This pragma was originally designed for experienced developers. The issue you mentioned is very valuable and I will consider it in depth. Thanks.

Currently, this pramga only works in a simple for/while loop. If we need to support range-based for loops, it can be extended.

1 Like

Perhaps teaching vectorizer to vectorize in presence of prefetch intrinsic would be better, if it’s feasible.

I am now implementing a vectorization of builtin_prefetch (prefetch intrinsic) and will register it with phabricator for review around the end of March 2023.

I am thinking that builtin_prefetch would be useful for prefetching indirect access.

I see some use of a loop pragma, in a scenario like this:

#pragma clang loop prefetch(<trip count ahead>, <locality>, <cache level>)
for (int i = 0; i < n; ++i) {
  f(a[i] + b[2*i]);
}

where the compiler would insert prefetch instructions for each of the accesses taking account each accesses’ stride and size. That is, a is prefetched sizeof(*a)*<trip count ahead> bytes ahead and sizeof(*b)*2*<trip count ahead> for b.

The other way is a builtin where the the user has to specify each pointer and the prefetch distance manually, such as:

  __builtin_prefetch(&a[i*<trip count ahead>]);
  __builtin_prefetch(&b[2*i*<trip count ahead>]);
  f(a[i] + b[2*i]);

I think it makes sense to have both. Maybe the compiler could derive <cache level> and <locality> argument automatically from the stride+cache sizes so there is less that the user could do wrong?

The proposal here seems to lie between those, where the user has to specify each variable separately in a pragma. For such cases I would also prefer using the builtin, since it is not too hard to compute the byte offset manually when you want more fine-grained control than just prefetch everything. The LoopVectorize pass should be able to handle prefetch intrinsics.

If using an intrinsic/metadata relative to a loop, it should reference the loop it is referring to. This is because instructions can be hoisted/sunk from/into loops or the loop unrolled.

for (int i = 0; i < n; ++i) {
  #pragma clang loop prefetch(a)
  for (int j = 0; j < 4; ++j) {
    f(a[i+j]); !llvm.loop.prefetch
  }
}

The j-loop might be fully unrolled in which case it the i-loop would become the accesses’ innermost, with a different stride offset. See llvm.loop.parallel_accesses for another metadata that references the loop.

I think I need to redesign this pragma. I intend to perfect this work with the following four pragma.

  1. #pragma clang loop prefetch(disable) LoopDataPrefetch Pass is enabled by default in some CPU architectures, such as aarch64. However, the perfect loop data prefetch policy cannot be always guaranteed. Incorrect prefetch mode may deteriorate program performance. The pragma may be used to disable data prefetching for a specifical loop. I think this will make it easy for developers to do performance debugging.
  2. #pragma clang loop prefetch(enable) The LoopDataPrefetch Pass is allowed to provide loop data prefetch for all loads/stores in the loop as much as possible. The specific prefetch implementation depends on the LoopDataPrefetch capability. The pragma has not yet determined a specific support solution. Maybe I’ll put it at the end for support.
  3. #pragma clang loop prefetch(variable[[, <r/w>, , ]; variable[, …];…]) The variable is a required argument, and the <r/w>, <trip count ahead>, <cache level> arguments are optional. The reason why this pragma is proposed is to take full advantage of the LoopDataPrefetch Pass, On the other hand, it is difficult for the downstream to expand builtin_prefetch, because if the design of builtin_prefetch is different between different compilers, there will be compatibility problems, which will cause the program to fail to compile normally. So for programs that use builtin_prefetch, the migration cost of project developers is very high. I think pragma can well avoid this problem.
  4. #pragma clang loop noprefetch(variable) Provides fine-grained data prefetch control. Currently, at the compiler level, there is no control method that can be specifically used to disable data prefetching of a certain variable. Now, we hope to provide a control method.

I don’t think a pragma instead of a solves this problem. Like the builtin, the pragma is a compiler extension that’s not implemented all compilers. If the compiler does not support a pragma, compilers typically print a warning, but programmers can also disable the builtin using #define __builtin_prefetch(...) and a well-maintained source would try to avoid the pragma warning as well anyway. At least __builtin_prefetch is supported by Clang and GCC, with Clang’s version intended to have the same semantics as GCC’s.

I recognize that the difference from __builtin_prefetch is under discussion.

How about the pragma suggests to the compiler that there is some cache miss in the loop? In other words, there are access patterns that cannot be covered by the hardware prefetcher. Such hints are meaningful because it is often difficult to determine at compile time whether a prefetcher will fail. For example, the case of a large stride access where the stride is not a compile-time constant.

A typical use case would then be to identify and hint at loops where cache misses are occurring from a runtime profile. For users without knowledge of the prefetcher specification, __builtin_prefetch cannot solve the problem because they do not know which accesses should be prefetched.

A feature should be added to the prefetch pass that will insert prefetches for accesses that may miss the cache at runtime, i.e., accesses that the prefetcher for the target processor may not be able to handle if the hint is given.

@Meinersbur Thank you for the discussion.
Here is an example of the behavior of the above proposal:

#pragma clang loop prefetch(enable)  // Telling that cache misses occurred
for (i=0; i<n; i++) {
  // Prefetch instructions are not necessary for stream accesses
  // because it is sufficient in many cases with a hardware prefetcher.
  a0[i] = ...
  a1[i] = ...
  a2[i] = ...

  // Nonconstant stride accesses are likely to be the cause of cache misses, 
  // so prefetch instructions are inserted.
  b[i*s] = ...
}

However, if the loop trip count is small and the prefetcher does not warm up, cache misses may occur even with stream access. It may be helpful to survey the actual program and consider the kinds of hints and prefetch targets.

If you already know what data you want to prefetch, the __builtin_prefetch is as powerful as the pragma. In your example

for (i=0; i<n; i++) {
  a0[i] = ...
  a1[i] = ...
  a2[i] = ...

  __builtin_prefetch(&b[(i+1)*s], 0, 3); // Prefetch the next iteration
  b[i*s] = ...
}

and there is not much difference to #pragma clang loop prefetch(b, 1, 1) that was originally proposed other than syntax. This corresponds to

I think the original motivation was to insert the prefetch intrinsics after LoopVectorize which cannot handle them. IMHO that should be fixed instead.

The other use case is “I want the data accessed in this loop to be prefetched”. I think here the loop pragma would be useful, where the compiler can apply heuristics to which data makes sense to prefetch according to a hardware model. This corresponds to the use case

LoopDataPrefetch already does this, e.g. t has isStrideLargeEnough which would exclude the a[i] cases (depending on what what size a[i] is). However, it is conservative its profitability heursitic:

  const auto *ConstStride = dyn_cast<SCEVConstant>(AR->getStepRecurrence(*SE));
  // If MinStride is set, don't prefetch unless we can ensure that stride is
  // larger.
  if (!ConstStride)
    return false;

The pragma could reverse the heuristic from conservative to optimistic. That’s essentially what #pragma clang loop vectorize(enable) and unroll(enable) also do.

Implementation-wise, this only needs a single metadata such as !{ !"llvm.loop.prefetch.enable", i1 true } attached to the loop, no need to encode which variable. Look at LoopVectorize’s LoopVectorizeHints how it handles this. However, please don’t replicate the idea of having a Hints class, it’s clunky and we have convenience accessors for metadata in LoopUtils.h.

If at some point you want to have more control over specific access (e.g. exclude a specific variable from being accessed), we could insert it at a later point as it’s a lot more difficult to pass via metadata.

Thanks for everyone’s input. For #pragma clang loop prefetch(variable[,...]), there are indeed some things that are not considered. Although the use of pragma is very suitable for application migration between different compilers, developers do not need to modify the code. But this pragma lacks an advantage over builtin_prefetch, so I want to leave it on hold. However, for #pragma clang loop prefetch(enable/disable), I think it is still useful. I plan to improve this pragma and submit the code as soon as possible.