[RFC] Loop vectorization for __builtin_prefetch

Introduction

In some cases, __builtin_prefetch is inserted into a loop by a developer (especially when tuning an HPC application) to make the application faster.
However, the current __builtin_prefetch prevents vectorization if a __builtin_prefetch is inserted into a vectorizable loop.
As a result, prefetching cannot be intentionally inserted into a vectorized loop.

The proposal is to allow a loop to be vectorized even when a built-in_prefetch is inserted into a vectorizable loop.
If you have any requests or comments, please let me know.

Basic Policy

Vectorize only if the architecture has prefetch for SIMD or Gather prefetch.
Vectorize Load/Store in a way that utilizes the vectorization process (related to VPWidenMemoryInstructionRecipe).
Resolve vectorization of the addressing part.

Vectorization example of builitin-prefetch

Continuous prefetch usage example (vector.body only)

C source code

void foo(double * restrict a, double * restrict b, int n){
  int i;
  for(i=0; i<n; ++i){
    a[i] = a[i] + b[i];
    __builtin_prefetch(&(b[i+8]));
  }
}

IR (only for vector.body)

  %23 = phi i64 [ 0, %10 ], [ %33, %22 ]
  %24 = phi <vscale x 2 x i64> [ %15, %10 ], [ %34, %22 ]
  %25 = getelementptr inbounds double, ptr %0, i64 %23
  %26 = load <vscale x 2 x double>, ptr %25, align 8, !tbaa !6
  %27 = getelementptr inbounds double, ptr %1, i64 %23
  %28 = load <vscale x 2 x double>, ptr %27, align 8, !tbaa !6
  %29 = fadd fast <vscale x 2 x double> %28, %26
  store <vscale x 2 x double> %29, ptr %25, align 8, !tbaa !6
  %30 = extractelement <vscale x 2 x i64> %24, i64 0
  %31 = add nuw nsw i64 %30, 8
  %32 = getelementptr inbounds double, ptr %1, i64 %31
  tail call void @llvm.masked.prefetch.p0.nxv2i1(ptr nonnull %32, i32 8, i32 0, i32 3, <vscale x 2 x i1> shufflevector (<vscale x 2 x i1> insertelement (<vscale x 2 x i1> poison, i1 true, i64 0), <vscale x 2 x i1> poison, <vscale x 2 x i32> zeroinitializer))
  %33 = add nuw i64 %23, %21
  %34 = add <vscale x 2 x i64> %24, %19
  %35 = icmp eq i64 %33, %14
  br i1 %35, label %36, label %22, !llvm.loop !10

example of using gather prefetch

C source code

void foo(double * restrict a, double * restrict b, int n){
  int i;
  for(i=0; i<n; i+=4){
    a[i] = a[i] + b[i];
    __builtin_prefetch(&(b[i+8]));
  }
}

IR (only for vector.body)

  %30 = phi i64 [ 0, %12 ], [ %39, %29 ]
  %31 = phi <vscale x 2 x i64> [ %22, %12 ], [ %40, %29 ]
  %32 = getelementptr inbounds double, ptr %0, <vscale x 2 x i64> %31
  %33 = tail call <vscale x 2 x double> @llvm.masked.gather.nxv2f64.nxv2p0(<vscale x 2 x ptr> %32, i32 8, <vscale x 2 x i1> shufflevector (<vscale x 2 x i1> insertelement (<vscale x 2 x i1> poison, i1 true, i64 0), <vscale x 2 x i1> poison, <vscale x 2 x i32> zeroinitializer), <vscale x 2 x double> poison), !tbaa !6
  %34 = getelementptr inbounds double, ptr %1, <vscale x 2 x i64> %31
  %35 = tail call <vscale x 2 x double> @llvm.masked.gather.nxv2f64.nxv2p0(<vscale x 2 x ptr> %34, i32 8, <vscale x 2 x i1> shufflevector (<vscale x 2 x i1> insertelement (<vscale x 2 x i1> poison, i1 true, i64 0), <vscale x 2 x i1> poison, <vscale x 2 x i32> zeroinitializer), <vscale x 2 x double> poison), !tbaa !6
  %36 = fadd fast <vscale x 2 x double> %35, %33
  tail call void @llvm.masked.scatter.nxv2f64.nxv2p0(<vscale x 2 x double> %36, <vscale x 2 x ptr> %32, i32 8, <vscale x 2 x i1> shufflevector (<vscale x 2 x i1> insertelement (<vscale x 2 x i1> poison, i1 true, i64 0), <vscale x 2 x i1> poison, <vscale x 2 x i32> zeroinitializer)), !tbaa !6
  %37 = add nuw nsw <vscale x 2 x i64> %31, shufflevector (<vscale x 2 x i64> insertelement (<vscale x 2 x i64> poison, i64 8, i64 0), <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer)
  %38 = getelementptr inbounds double, ptr %1, <vscale x 2 x i64> %37
  tail call void @llvm.masked.gather.prefetch.nxv2p0(<vscale x 2 x ptr> %38, i32 8, i32 0, i32 3, <vscale x 2 x i1> shufflevector (<vscale x 2 x i1> insertelement (<vscale x 2 x i1> poison, i1 true, i64 0), <vscale x 2 x i1> poison, <vscale x 2 x i32> zeroinitializer))
  %39 = add nuw i64 %30, %28
  %40 = add <vscale x 2 x i64> %31, %26
  %41 = icmp eq i64 %39, %19
  br i1 %41, label %42, label %29, !llvm.loop !10

Prototype

Phabricator URL:
https://reviews.llvm.org/D156068

Issue.

We have the following issues with this prototype. We would like to get your feedback on these.

Choosing prefetch type

Depending on the architecture, there are several different types of prefetch, including scalar prefetch, continuous prefetch (with mask), and gather prefetch, which must be chosen appropriately. The choice depends on the cost of each type of prefetch and the actual data needed in the vectorized loop (scalar, continuous, or discrete data).

Calculate cost for vectorization

In this prototype, the vectorization cost calculation is adapted from the Load/Store cost calculation. If we want to calculate the vectorization cost of prefetch more accurately, it may be better to implement it separately.

Appropriate address

Unlike Load/Store, prefetch is a process of preparing necessary data in the cache in advance, so the appropriate address for vectorized prefetch depends on the element size, vectorization width, and cache line size of the data to be prefetched.
In this prototype, the same addressing process is applied as the vectorization process in Load/Store, so the addressing for vectorization prefetch may not be appropriate for improving cache access efficiency.