Memory Prefetching Support in LLVM

I am curious to understand if LLVM does enough to insert prefetching instructions while generating code. To the best of my understanding, most backends expose prefetch instructions via intrinsics and let the user handle them. However, theoretically, compiler can analyze the IR and insert the prefetch instructions to make the cache line available.

While I understand that software prefetching has its pros and cons, I am curious if LLVM had any RFC or if anyone has tried this before for X86 or other backends.

(In the modern architectures which have smarter hardware prefetchers, it probably doesn’t make sense to invest in software prefetching but some articles like [1] support the idea of covering broader access patterns in the compiler and meticulously handling them in the compiler.)

Thoughts?

Thanks,
Madhur

[1] Prefetching

For compiler inserted prefetching, you can refer to this paper: https://web.eecs.umich.edu/~takh/papers/jamilan-apt-get-eurosys-2022.pdf

The insertion pass is not upstreamed, but available here: GitHub - SabaJamilan/Profile-Guided-Software-Prefetching

The autoFDO based prefetching inserting infrastructure is in LLVM: âš™ D54052 Support for inserting profile-directed cache prefetches

Select (sub)targets support LoopDataPrefetch pass, https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Scalar/LoopDataPrefetch.cpp

To see typical examples, see the tests under https://github.com/llvm/llvm-project/tree/main/llvm/test/Transforms/LoopDataPrefetch

Note the required configuration including CacheLineSize and PrefetchDistance for the relevant subtargets, e.g., for A64FX, https://github.com/llvm/llvm-project/blob/514381840c6d7aa775a092556992c87f022a361f/llvm/lib/Target/AArch64/AArch64Subtarget.cpp#L161

One thing to add about patterns, I tried (cca 2019) the technique in [1]: if you have vectors of pointers, you can use the induction variable to predict where future cache misses will happen. It works well on microbenchmarks. It’s attractive to datacenter workloads because (no surprise) cache misses happen a lot. We targeted it, there, at places where we actually see data cache misses (from PMU data).

It didn’t materialize.

One reason is that (and it’s so evident in hindsight!) loops in data center apps have low trip count. Not surprising (the hindsight…) - you shard data and parallelize, rather than spin over it locally. The optimization really has value when the trip count is sufficiently large, much larger than the lookahead, so as to amortize the cost of the extra compute (the lookahead + the load) by the savings from avoiding future cache misses.

A second reason is that the prefetch code pattern that needs inserting needs to perform a load (you need to load vec[i+lookahead], so you need to make sure the load can happen safely. At minimum that means i+lookahead < vec.size() but even that has hidden surprises. One is that vec.size() may not be provably constant (because alias analysis fails to help prove vec isn’t mutated in the loop), so you now have a costly extra boundary check (or you choose to bail out of the optimization here). We could warn the developer to see if they can reformulate the loop (i.e. rewrite as for (auto i = 0, e = vec.size(); i < e; ++i))… and there were other traps: lots of code likes to fast-exit the loop (again, no surprise in hindsight - the goal is finishing a request fast) - so then you pay for the extra lookahead for nothing (or, again, bail out of the optimization); in some cases, the exit condition is predicated on data you discover in the loop; etc.

Long story short, at least for our kinds of workloads, the short trip count, coupled with the “tricky” real-world loops like the ones I described, made the optimization impractical.

What we learned:

  • some of the things I mention as “in hindsight obvious” weren’t immediately obvious (esp. the “nature of loops where this could work”), and actually doing the experiment was valuable in uncovering these details
  • it’d be so nice if we had an indirect prefetch! but we don’t. This isn’t a new idea, afaik - it’d be like a prefetch_after_load <address>, that would not fault if <address> couldn’t be loaded. This hypothetical instruction would remove any need for expensive boundary checks - it’d probably invite its own other problems, like mem bandwidth pressure, so maybe we’re doing a “grass is greener on the other side” fallacy here.

Maybe we gave up too early? Hard to tell. My hope here is not to discourage but to set expectations about, or scope, some paths that, to us, seemed initially attractive. Maybe there are other unexplored paths that we discouraged ourselves out of too early, that others can try and succeed at?

[1] https://www.cl.cam.ac.uk/~sa614/papers/Software-Prefetching-CGO2017.pdf

2 Likes

Thanks @davidxl , @MattPD and @mtrofin for sharing insights and the pointers.
This is indeed a difficult problem and I appreciate sharing the experiences on implementing it.

1 Like

There is also the situation where prefetching can screw up the sequential consistency requirement ATOMIC things need.

@Mitch_Alsup could you elaborate?

For compiler inserted prefetching, you can refer to this paper: https://web.eecs.umich.edu/~takh/papers/jamilan-apt-get-eurosys-2022.pdf

The insertion pass is not upstreamed, but available here: GitHub - SabaJamilan/Profile-Guided-Software-Prefetching

BTW, are there people out there who succeeded using this phase and reproducing results from the paper?

Performance claims (from the paper) are amazing; but to be honest, after several hours debugging the scripts and looking into the code I realized that I have to invest a lot of time just to make it workable. Upstreaming is another story – right now it’s a “quick student job” in the worst possible sense. It has to be completely rewritten just to be considered for upstreaming.

I probably stop here (as I have better use for my time) – but if someone already has a working version and can share it, this would be appreciated. The paper refers “Google data centers”; also, it’s referenced in Google’s slides from “Practical Compiler Optimizations for WSC Workshop @ US LLVM Dev Meeting 2023”, so maybe Google guys have something they can share? @dhoekwater ?

Yours,
Andrey
===
Advanced Software Technology Lab
Huawei

Consider:: (a lousy example)

 ...
 atomic{
      i = array[k]; 
      matrix[i,k] = x;
 }

You definitely cannot move the LD of i outside of the atomic block.

And you really don’t want to move any prefetch of i outside of the atomic block even if you leave the actual LD of i inside the block–the external observers will see the prefetch out of sequentially consistent order wrt to the source code.

But how would an external observer actually make an observation of the prefetched value?

Sadly, I don’t have a working version to share (though I’m not personally super familiar with that part of Google’s optimization stack, so maybe someone else can chime in here).

I referenced that paper in the instructor notes to convey that Google uses selective software prefetching (to improve dcache performance while limiting memory bandwidth impact); I listed the APT-GET paper as an example of current research, but I don’t know of any current efforts to upstream this functionality.

Got it. Thanks for the clarification!

We’re looking at deleting two of the passes mentioned in this thread ([X86] Delete Profile Guided Prefetch Passes by boomanaiden154 · Pull Request #167317 · llvm/llvm-project · GitHub). There is currently no way to generate profiles useable by these passes, and there are no known users to us. If there’s someone experimenting with them, happy to keep them in tree, but otherwise I’ll land the patch early next week.

1 Like