[RFC] Loop vectorization for __builtin_prefetch

m-saito-fj · July 24, 2023, 2:12am

Introduction

In some cases, __builtin_prefetch is inserted into a loop by a developer (especially when tuning an HPC application) to make the application faster.
However, the current __builtin_prefetch prevents vectorization if a __builtin_prefetch is inserted into a vectorizable loop.
As a result, prefetching cannot be intentionally inserted into a vectorized loop.

The proposal is to allow a loop to be vectorized even when a built-in_prefetch is inserted into a vectorizable loop.
If you have any requests or comments, please let me know.

Basic Policy

Vectorize only if the architecture has prefetch for SIMD or Gather prefetch.
Vectorize Load/Store in a way that utilizes the vectorization process (related to VPWidenMemoryInstructionRecipe).
Resolve vectorization of the addressing part.

Vectorization example of builitin-prefetch

Continuous prefetch usage example (vector.body only)

C source code

void foo(double * restrict a, double * restrict b, int n){
  int i;
  for(i=0; i<n; ++i){
    a[i] = a[i] + b[i];
    __builtin_prefetch(&(b[i+8]));
  }
}

IR (only for vector.body)

  %23 = phi i64 [ 0, %10 ], [ %33, %22 ]
  %24 = phi <vscale x 2 x i64> [ %15, %10 ], [ %34, %22 ]
  %25 = getelementptr inbounds double, ptr %0, i64 %23
  %26 = load <vscale x 2 x double>, ptr %25, align 8, !tbaa !6
  %27 = getelementptr inbounds double, ptr %1, i64 %23
  %28 = load <vscale x 2 x double>, ptr %27, align 8, !tbaa !6
  %29 = fadd fast <vscale x 2 x double> %28, %26
  store <vscale x 2 x double> %29, ptr %25, align 8, !tbaa !6
  %30 = extractelement <vscale x 2 x i64> %24, i64 0
  %31 = add nuw nsw i64 %30, 8
  %32 = getelementptr inbounds double, ptr %1, i64 %31
  tail call void @llvm.masked.prefetch.p0.nxv2i1(ptr nonnull %32, i32 8, i32 0, i32 3, <vscale x 2 x i1> shufflevector (<vscale x 2 x i1> insertelement (<vscale x 2 x i1> poison, i1 true, i64 0), <vscale x 2 x i1> poison, <vscale x 2 x i32> zeroinitializer))
  %33 = add nuw i64 %23, %21
  %34 = add <vscale x 2 x i64> %24, %19
  %35 = icmp eq i64 %33, %14
  br i1 %35, label %36, label %22, !llvm.loop !10

example of using gather prefetch

C source code

void foo(double * restrict a, double * restrict b, int n){
  int i;
  for(i=0; i<n; i+=4){
    a[i] = a[i] + b[i];
    __builtin_prefetch(&(b[i+8]));
  }
}

IR (only for vector.body)

  %30 = phi i64 [ 0, %12 ], [ %39, %29 ]
  %31 = phi <vscale x 2 x i64> [ %22, %12 ], [ %40, %29 ]
  %32 = getelementptr inbounds double, ptr %0, <vscale x 2 x i64> %31
  %33 = tail call <vscale x 2 x double> @llvm.masked.gather.nxv2f64.nxv2p0(<vscale x 2 x ptr> %32, i32 8, <vscale x 2 x i1> shufflevector (<vscale x 2 x i1> insertelement (<vscale x 2 x i1> poison, i1 true, i64 0), <vscale x 2 x i1> poison, <vscale x 2 x i32> zeroinitializer), <vscale x 2 x double> poison), !tbaa !6
  %34 = getelementptr inbounds double, ptr %1, <vscale x 2 x i64> %31
  %35 = tail call <vscale x 2 x double> @llvm.masked.gather.nxv2f64.nxv2p0(<vscale x 2 x ptr> %34, i32 8, <vscale x 2 x i1> shufflevector (<vscale x 2 x i1> insertelement (<vscale x 2 x i1> poison, i1 true, i64 0), <vscale x 2 x i1> poison, <vscale x 2 x i32> zeroinitializer), <vscale x 2 x double> poison), !tbaa !6
  %36 = fadd fast <vscale x 2 x double> %35, %33
  tail call void @llvm.masked.scatter.nxv2f64.nxv2p0(<vscale x 2 x double> %36, <vscale x 2 x ptr> %32, i32 8, <vscale x 2 x i1> shufflevector (<vscale x 2 x i1> insertelement (<vscale x 2 x i1> poison, i1 true, i64 0), <vscale x 2 x i1> poison, <vscale x 2 x i32> zeroinitializer)), !tbaa !6
  %37 = add nuw nsw <vscale x 2 x i64> %31, shufflevector (<vscale x 2 x i64> insertelement (<vscale x 2 x i64> poison, i64 8, i64 0), <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer)
  %38 = getelementptr inbounds double, ptr %1, <vscale x 2 x i64> %37
  tail call void @llvm.masked.gather.prefetch.nxv2p0(<vscale x 2 x ptr> %38, i32 8, i32 0, i32 3, <vscale x 2 x i1> shufflevector (<vscale x 2 x i1> insertelement (<vscale x 2 x i1> poison, i1 true, i64 0), <vscale x 2 x i1> poison, <vscale x 2 x i32> zeroinitializer))
  %39 = add nuw i64 %30, %28
  %40 = add <vscale x 2 x i64> %31, %26
  %41 = icmp eq i64 %39, %19
  br i1 %41, label %42, label %29, !llvm.loop !10

Prototype

Phabricator URL:

Issue.

We have the following issues with this prototype. We would like to get your feedback on these.

Choosing prefetch type

Depending on the architecture, there are several different types of prefetch, including scalar prefetch, continuous prefetch (with mask), and gather prefetch, which must be chosen appropriately. The choice depends on the cost of each type of prefetch and the actual data needed in the vectorized loop (scalar, continuous, or discrete data).

Calculate cost for vectorization

In this prototype, the vectorization cost calculation is adapted from the Load/Store cost calculation. If we want to calculate the vectorization cost of prefetch more accurately, it may be better to implement it separately.

Appropriate address

Unlike Load/Store, prefetch is a process of preparing necessary data in the cache in advance, so the appropriate address for vectorized prefetch depends on the element size, vectorization width, and cache line size of the data to be prefetched.
In this prototype, the same addressing process is applied as the vectorization process in Load/Store, so the addressing for vectorization prefetch may not be appropriate for improving cache access efficiency.

Topic		Replies	Views
Vectorization of __builtin_prefetch in AArch64 AArch64 llvm	0	363	December 12, 2023
[RFC] Adding support #pragma clang loop [no]prefetch() for prefetch Clang Frontend rfc	14	2296	November 22, 2023
Memory Prefetching Support in LLVM Common CodeGen Infrastructure	12	1516	November 10, 2025
LLVM LoopDataPrefetch pass LLVM Project	0	522	January 26, 2022
RFC: Extend PowerPC SW prefetching pass to other targets LLVM Dev List Archives	1	74	January 5, 2016

[RFC] Loop vectorization for __builtin_prefetch

Introduction

Basic Policy

Vectorization example of builitin-prefetch

Continuous prefetch usage example (vector.body only)

C source code

IR (only for vector.body)

example of using gather prefetch

C source code

IR (only for vector.body)

Prototype

Issue.

Choosing prefetch type

Calculate cost for vectorization

Appropriate address

Related topics