[RFC] Universal Profile-Guided Automatic Software Loop Data Prefetcher for LLVM
Full design document
Initial related LLVM PR: PR #166553
Summary
This RFC proposes upstreaming the current implementation of a profile-guided automatic software loop data prefetcher for LLVM. The implementation is an LLVM function pass for AArch64 platforms that support SPE, consumes collected load profiles, and addresses our current motivating workloads and engineering needs.
The prefetcher is called universal because the complete design is not restricted to a fixed set of address-expression templates. Its target model is any load whose memory address can be computed ahead of time, regardless of whether the address-producing slice contains multiple PHI nodes, control flow, recurrences such as pointer-chasing loops, inner computations, or cross-call dependencies. The current implementation covers a practical subset of this design; the additional full-design scope is summarized near the end of this RFC and described in detail in the linked document.
The core of the framework — the per-loop pipeline, dependency DAG, distance scheduling, profitability and overhead budgeting, and hoisting of original instructions — is architecture-neutral and can be reused as-is on other targets. Porting to a new target still requires target-specific work: wiring up an equivalent precise event-based sampling source (for example, PEBS on x86) as a profile provider, tuning the heuristics for the target’s microarchitectural characteristics, and configuring the prefetch safety machinery to the instruction set actually available on that target.
The proposed prefetcher includes the existing LoopDataPrefetch functionality as a subset: it covers SCEV-expressible single-induction-variable access patterns and can be configured to behave equivalently. The current implementation already goes beyond that subset with profile-driven per-load distance selection, dependency-aware scheduling, safety handling, overhead budgeting, and hoisting of original instructions. The full design further broadens the supported address-computation space, as summarized below. The intended end state is for the new prefetcher to subsume and replace LoopDataPrefetch.
Motivation
Memory latency remains a major performance limiter. Loads that miss in the last-level cache, or incur expensive address-translation behavior such as DTLB misses, can stall dependent execution for hundreds of cycles. These delinquent loads are often irregular and data-dependent, which makes them difficult for hardware prefetchers and existing compiler prefetch passes to handle.
LLVM’s existing LoopDataPrefetch pass is fully static. It targets affine, single-induction-variable address expressions expressible by SCEV as {base, +, stride}, and its prefetch distance comes from a target-provided constant rather than from workload behavior.
Recent academic systems extend the scope but remain narrow. APT-GET is profile-guided, but every PHI in the address slice must be a loop induction variable, so non-IV recurrences such as pointer-chasing loops are out of scope. RPG2 recognizes only three fixed address pattern templates tied to an inner/outer induction-variable structure. In both systems, analysis and placement are confined to the two innermost loop levels, and outer-loop hoisting is limited to a single immediately enclosing loop.
The proposed framework addresses these limitations with a profile-guided, latency-ranked, safety-aware, and overhead-aware prefetching pipeline.
Profiling
The current implementation uses two profile inputs:
- a load profile, collected through precise event-based sampling, describing the memory behavior of individual load instructions;
- an optional branch profile, used to estimate loop trip counts, per-load hotness relative to the loop header, and average instruction count per loop iteration.
Load-profile collection is integrated into llvm-profgen. Load-latency information is extracted from SPE profiling samples, recorded in an extended profile format, and attached to LLVM IR load instructions through MD_prof.
The optional branch profile may be collected together with the load profile or in a separate profiling run. It is used to estimate the loop trip count, the load hotness relative to the loop header, and the average instruction count per loop iteration; when absent, the pass falls back to static estimates.
Current Implementation Overview
The current prefetcher is an LLVM function pass for AArch64 platforms that support SPE.
For every loop, the current implementation performs the following high-level pipeline:
- Identify candidate delinquent loads from the load profile.
- Deduplicate loads that access the same cache line.
- Compute the address-producing predecessor slice for each candidate.
- Build a dependency DAG between prefetches when one prefetch’s address computation depends on another prefetched load.
- Compute prefetch distances recursively over that DAG.
- Apply per-load profitability and global overhead budgeting.
- Generate prefetch code by cloning the address computation and hoisting it a selected number of loop iterations ahead, so it prefetches data for a future iteration.
- Apply safety mechanisms for potentially unsafe speculative execution.
- When profitable, hoist original instructions to replace duplicated cloned computation with pipelined original computation communicated through ring buffers.
Ring buffers are used when an original loop instruction is hoisted into an earlier prefetch pipeline stage. The moved instruction produces a value for a future iteration, while its original consumers still execute later. A small per-value circular buffer stores the produced values until the corresponding later iteration consumes them; indexing is derived from the loop induction variable and wraps with a power-of-two mask. A prologue initializes the entries needed by the first steady-state iterations.
Safety
Prefetch address computation executes ahead of its original program point, so it can encounter inputs that the original program would not have produced at that point. The framework handles this explicitly instead of assuming that generated prefetch code is always safe.
The safety logic focuses on two main sources of risk:
- post-exit iteration execution, where prefetch code computes values for iterations beyond the loop exit;
- store-load aliasing, where an early load in the prefetch address computation may observe a value different from the original execution order.
The current implementation provides operand sanitization and bypass branches. It can also use target-provided non-faulting load variants as a safety mechanism; on AArch64, this is implemented with ldnf1d.
Performance Notes
The current performance tuning was done on NVIDIA Grace CPU systems, based on Arm Neoverse V2 cores. Software prefetching is highly sensitive to microarchitectural details, so thresholds and conclusions may be CPU-specific.
On internal workloads, the current implementation has shown speedups of up to +56%. Among open and publicly known benchmarks, representative results obtained on serial (single-threaded) benchmark versions include (links to the benchmark sources are provided in the full design document):
- graph500/bfs: +31%
- gapbs/sssp: +11%
- gapbs/bc: +10%
- hashjoin-ph-2: +22%
These benchmarks cover different forms of indirect memory access patterns.
The loads that benefited most consistently were those with frequent DRAM or DTLB misses. A key limiting case is when the original demand misses are already too frequent: they can saturate memory-system throughput and leave little room for software prefetches, which are often deprioritized or dropped in such situations. Another limiting factor was small loop trip count, especially because promotion of prefetches into enclosing loops is not yet implemented. This is a trade-off rather than a hard cutoff; we observed useful cases around trip count 32.
Proposed Patch Plan
We currently expect the work to be split into the following patch sequence. This sequence describes the proposed review order, not the current internal implementation status; some later patches correspond to functionality that already exists internally but will be posted separately to keep reviews manageable.
- ExtBinary format extension. Extend the ExtBinary format used by the profile toolchain. This work is already ongoing: PR #166553. Review comments are welcome.
- AArch64 SPE branch profiling. Add SPE-based branch profiling and enable an
llvm-profgen-based profiling toolchain for AArch64. - AArch64 SPE load profiling. Add SPE-based load profiling and load-profile propagation into LLVM IR.
- Basic Universal Loop Data Prefetcher. Land a simplified reviewable subset of the pass without non-trivial CFG support, memory aliasing checks, non-faulting loads, hoisting of original instructions, or recurrence support.
- Non-trivial CFG support. Add support for address computations involving non-trivial control flow.
- Memory aliasing checks. Add memory alias analysis and checks needed by the prefetch safety machinery.
- Non-faulting load support. Add support for using non-faulting load variants in prefetch address computation where the target supports them.
- Recurrence support. Add support for non-IV recurrences in address computations.
- Hoisting of original instructions. Add hoisting of original instructions and the corresponding ring-buffer pipeline machinery.
Full Design Extensions
The full design is structured to accommodate several extensions beyond the current implementation. These extensions are not required for the initial upstream version and can be implemented incrementally when there is sufficient motivation:
- direct latency sampling as an additional profiling mode on targets where accurate load-latency samples are available;
- loop preheader prefetching for early iterations;
- recursive promotion of prefetches through enclosing loops;
- function-call handling, including partial inlining of callee address-computation slices;
- loop versioning for pre-loop safety checks;
- richer control-flow support in address computations, including
switchinstructions and inner loops; - inter-stage clone deduplication for non-relocatable instructions.
Request for Guidance
We would appreciate community input on the following items:
1. Community interest. Does the community see this direction - a profile-guided, latency-ranked, safety-aware universal software prefetcher - as a useful direction for LLVM to pursue upstream?
2. Patch-series ordering. Is the patch sequence above reviewable as proposed, or should some pieces be split further, reordered, or grouped differently?
3. Pass placement. Our current proposal is to run the new pass from the same pipeline location that currently invokes LoopDataPrefetch; this placement works for both the regular optimization pipeline and LTO. We would still appreciate feedback on this choice.
4. IPO ownership. The complete design includes function-call handling, where address-computation slices may cross call boundaries or require partial inlining of the relevant callee slice. The current implementation does not include this piece yet. Should the pass source live under IPO from the beginning to reflect this future direction, or should the initial upstream version remain organized as a function pass and move toward IPO only when function-call handling is implemented?
5. Replacement plan for LoopDataPrefetch. The new prefetcher is intended to subsume and replace LoopDataPrefetch, with a compatibility mode that mimics the existing static SCEV-based behavior. We would appreciate guidance on transition strategy, default enablement policy, and criteria for retiring or disabling the old pass.
6. Cross-cutting concerns. Anything we should pre-empt or discuss before posting patches rather than discovering only during review.
Thanks in advance for any guidance the community can offer.