Memory Prefetching Support in LLVM

madhur13490 · October 13, 2023, 5:36pm

I am curious to understand if LLVM does enough to insert prefetching instructions while generating code. To the best of my understanding, most backends expose prefetch instructions via intrinsics and let the user handle them. However, theoretically, compiler can analyze the IR and insert the prefetch instructions to make the cache line available.

While I understand that software prefetching has its pros and cons, I am curious if LLVM had any RFC or if anyone has tried this before for X86 or other backends.

(In the modern architectures which have smarter hardware prefetchers, it probably doesn’t make sense to invest in software prefetching but some articles like [1] support the idea of covering broader access patterns in the compiler and meticulously handling them in the compiler.)

Thoughts?

Thanks,
Madhur

[1] Prefetching

davidxl · October 13, 2023, 6:33pm

For compiler inserted prefetching, you can refer to this paper: https://web.eecs.umich.edu/~takh/papers/jamilan-apt-get-eurosys-2022.pdf

The insertion pass is not upstreamed, but available here: GitHub - SabaJamilan/Profile-Guided-Software-Prefetching

The autoFDO based prefetching inserting infrastructure is in LLVM: ⚙ D54052 Support for inserting profile-directed cache prefetches

MattPD · October 14, 2023, 6:26am

Select (sub)targets support LoopDataPrefetch pass, https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Scalar/LoopDataPrefetch.cpp

To see typical examples, see the tests under https://github.com/llvm/llvm-project/tree/main/llvm/test/Transforms/LoopDataPrefetch

Note the required configuration including CacheLineSize and PrefetchDistance for the relevant subtargets, e.g., for A64FX, https://github.com/llvm/llvm-project/blob/514381840c6d7aa775a092556992c87f022a361f/llvm/lib/Target/AArch64/AArch64Subtarget.cpp#L161

mtrofin · October 14, 2023, 4:44pm

One thing to add about patterns, I tried (cca 2019) the technique in [1]: if you have vectors of pointers, you can use the induction variable to predict where future cache misses will happen. It works well on microbenchmarks. It’s attractive to datacenter workloads because (no surprise) cache misses happen a lot. We targeted it, there, at places where we actually see data cache misses (from PMU data).

It didn’t materialize.

One reason is that (and it’s so evident in hindsight!) loops in data center apps have low trip count. Not surprising (the hindsight…) - you shard data and parallelize, rather than spin over it locally. The optimization really has value when the trip count is sufficiently large, much larger than the lookahead, so as to amortize the cost of the extra compute (the lookahead + the load) by the savings from avoiding future cache misses.

A second reason is that the prefetch code pattern that needs inserting needs to perform a load (you need to load vec[i+lookahead], so you need to make sure the load can happen safely. At minimum that means i+lookahead < vec.size() but even that has hidden surprises. One is that vec.size() may not be provably constant (because alias analysis fails to help prove vec isn’t mutated in the loop), so you now have a costly extra boundary check (or you choose to bail out of the optimization here). We could warn the developer to see if they can reformulate the loop (i.e. rewrite as for (auto i = 0, e = vec.size(); i < e; ++i))… and there were other traps: lots of code likes to fast-exit the loop (again, no surprise in hindsight - the goal is finishing a request fast) - so then you pay for the extra lookahead for nothing (or, again, bail out of the optimization); in some cases, the exit condition is predicated on data you discover in the loop; etc.

Long story short, at least for our kinds of workloads, the short trip count, coupled with the “tricky” real-world loops like the ones I described, made the optimization impractical.

What we learned:

some of the things I mention as “in hindsight obvious” weren’t immediately obvious (esp. the “nature of loops where this could work”), and actually doing the experiment was valuable in uncovering these details
it’d be so nice if we had an indirect prefetch! but we don’t. This isn’t a new idea, afaik - it’d be like a prefetch_after_load <address>, that would not fault if <address> couldn’t be loaded. This hypothetical instruction would remove any need for expensive boundary checks - it’d probably invite its own other problems, like mem bandwidth pressure, so maybe we’re doing a “grass is greener on the other side” fallacy here.

Maybe we gave up too early? Hard to tell. My hope here is not to discourage but to set expectations about, or scope, some paths that, to us, seemed initially attractive. Maybe there are other unexplored paths that we discouraged ourselves out of too early, that others can try and succeed at?

[1] https://www.cl.cam.ac.uk/~sa614/papers/Software-Prefetching-CGO2017.pdf

madhur13490 · December 6, 2023, 7:47am

Thanks @davidxl , @MattPD and @mtrofin for sharing insights and the pointers.
This is indeed a difficult problem and I appreciate sharing the experiences on implementing it.

Mitch_Alsup · December 7, 2023, 11:08pm

mtrofin:

A second reason is that the prefetch code pattern that needs inserting needs to perform a load (you need to load vec[i+lookahead], so you need to make sure the load can happen safely. At minimum that means i+lookahead < vec.size() but even that has hidden surprises. One is that vec.size() may not be provably constant (because alias analysis fails to help prove vec isn’t mutated in the loop), so you now have a costly extra boundary check (or you choose to bail out of the optimization here). We could warn the developer to see if they can reformulate the loop (i.e. rewrite as for (auto i = 0, e = vec.size(); i < e; ++i))… and there were other traps: lots of code likes to fast-exit the loop (again, no surprise in hindsight - the goal is finishing a request fast) - so then you pay for the extra lookahead for nothing (or, again, bail out of the optimization); in some cases, the exit condition is predicated on data you discover in the loop; etc.

There is also the situation where prefetching can screw up the sequential consistency requirement ATOMIC things need.

mtrofin · December 7, 2023, 11:24pm

@Mitch_Alsup could you elaborate?

andreybokhanko · January 24, 2024, 10:53am

For compiler inserted prefetching, you can refer to this paper: https://web.eecs.umich.edu/~takh/papers/jamilan-apt-get-eurosys-2022.pdf

The insertion pass is not upstreamed, but available here: GitHub - SabaJamilan/Profile-Guided-Software-Prefetching

BTW, are there people out there who succeeded using this phase and reproducing results from the paper?

Performance claims (from the paper) are amazing; but to be honest, after several hours debugging the scripts and looking into the code I realized that I have to invest a lot of time just to make it workable. Upstreaming is another story – right now it’s a “quick student job” in the worst possible sense. It has to be completely rewritten just to be considered for upstreaming.

I probably stop here (as I have better use for my time) – but if someone already has a working version and can share it, this would be appreciated. The paper refers “Google data centers”; also, it’s referenced in Google’s slides from “Practical Compiler Optimizations for WSC Workshop @ US LLVM Dev Meeting 2023”, so maybe Google guys have something they can share? @dhoekwater ?

Yours,
Andrey
===
Advanced Software Technology Lab
Huawei

Mitch_Alsup · February 6, 2024, 1:57am

Consider:: (a lousy example)

 ...
 atomic{
      i = array[k]; 
      matrix[i,k] = x;
 }

You definitely cannot move the LD of i outside of the atomic block.

And you really don’t want to move any prefetch of i outside of the atomic block even if you leave the actual LD of i inside the block–the external observers will see the prefetch out of sequentially consistent order wrt to the source code.

mtrofin · February 6, 2024, 5:00am

But how would an external observer actually make an observation of the prefetched value?

dhoekwater · February 12, 2024, 11:20pm

Sadly, I don’t have a working version to share (though I’m not personally super familiar with that part of Google’s optimization stack, so maybe someone else can chime in here).

I referenced that paper in the instructor notes to convey that Google uses selective software prefetching (to improve dcache performance while limiting memory bandwidth impact); I listed the APT-GET paper as an example of current research, but I don’t know of any current efforts to upstream this functionality.

andreybokhanko · February 13, 2024, 3:16pm

Got it. Thanks for the clarification!

boomanaiden154-1 · November 10, 2025, 9:31pm

We’re looking at deleting two of the passes mentioned in this thread ([X86] Delete Profile Guided Prefetch Passes by boomanaiden154 · Pull Request #167317 · llvm/llvm-project · GitHub). There is currently no way to generate profiles useable by these passes, and there are no known users to us. If there’s someone experimenting with them, happy to keep them in tree, but otherwise I’ll land the patch early next week.

Topic		Replies	Views
Instruction Prefetching in llvm LLVM Dev List Archives	2	182	December 13, 2012
Use of llvm.prefetch intrinsic for basicblock prefetching LLVM Dev List Archives	4	245	September 23, 2014
data prefetch for machine instruction LLVM Dev List Archives	1	175	March 21, 2016
RFC: Code Prefetch Insertion LLVM Project pgo	22	1133	November 8, 2025
GSoC - feedback for Software Prefetching in LLVM LLVM Dev List Archives	1	100	March 16, 2015

Memory Prefetching Support in LLVM

Related topics