Memory Prefetching Support in LLVM

I am curious to understand if LLVM does enough to insert prefetching instructions while generating code. To the best of my understanding, most backends expose prefetch instructions via intrinsics and let the user handle them. However, theoretically, compiler can analyze the IR and insert the prefetch instructions to make the cache line available.

While I understand that software prefetching has its pros and cons, I am curious if LLVM had any RFC or if anyone has tried this before for X86 or other backends.

(In the modern architectures which have smarter hardware prefetchers, it probably doesn’t make sense to invest in software prefetching but some articles like [1] support the idea of covering broader access patterns in the compiler and meticulously handling them in the compiler.)

Thoughts?

Thanks,
Madhur

[1] Prefetching

For compiler inserted prefetching, you can refer to this paper: https://web.eecs.umich.edu/~takh/papers/jamilan-apt-get-eurosys-2022.pdf

The insertion pass is not upstreamed, but available here: GitHub - SabaJamilan/Profile-Guided-Software-Prefetching

The autoFDO based prefetching inserting infrastructure is in LLVM: ⚙ D54052 Support for inserting profile-directed cache prefetches

Select (sub)targets support LoopDataPrefetch pass, https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Scalar/LoopDataPrefetch.cpp

To see typical examples, see the tests under https://github.com/llvm/llvm-project/tree/main/llvm/test/Transforms/LoopDataPrefetch

Note the required configuration including CacheLineSize and PrefetchDistance for the relevant subtargets, e.g., for A64FX, https://github.com/llvm/llvm-project/blob/514381840c6d7aa775a092556992c87f022a361f/llvm/lib/Target/AArch64/AArch64Subtarget.cpp#L161

One thing to add about patterns, I tried (cca 2019) the technique in [1]: if you have vectors of pointers, you can use the induction variable to predict where future cache misses will happen. It works well on microbenchmarks. It’s attractive to datacenter workloads because (no surprise) cache misses happen a lot. We targeted it, there, at places where we actually see data cache misses (from PMU data).

It didn’t materialize.

One reason is that (and it’s so evident in hindsight!) loops in data center apps have low trip count. Not surprising (the hindsight…) - you shard data and parallelize, rather than spin over it locally. The optimization really has value when the trip count is sufficiently large, much larger than the lookahead, so as to amortize the cost of the extra compute (the lookahead + the load) by the savings from avoiding future cache misses.

A second reason is that the prefetch code pattern that needs inserting needs to perform a load (you need to load vec[i+lookahead], so you need to make sure the load can happen safely. At minimum that means i+lookahead < vec.size() but even that has hidden surprises. One is that vec.size() may not be provably constant (because alias analysis fails to help prove vec isn’t mutated in the loop), so you now have a costly extra boundary check (or you choose to bail out of the optimization here). We could warn the developer to see if they can reformulate the loop (i.e. rewrite as for (auto i = 0, e = vec.size(); i < e; ++i))… and there were other traps: lots of code likes to fast-exit the loop (again, no surprise in hindsight - the goal is finishing a request fast) - so then you pay for the extra lookahead for nothing (or, again, bail out of the optimization); in some cases, the exit condition is predicated on data you discover in the loop; etc.

Long story short, at least for our kinds of workloads, the short trip count, coupled with the “tricky” real-world loops like the ones I described, made the optimization impractical.

What we learned:

  • some of the things I mention as “in hindsight obvious” weren’t immediately obvious (esp. the “nature of loops where this could work”), and actually doing the experiment was valuable in uncovering these details
  • it’d be so nice if we had an indirect prefetch! but we don’t. This isn’t a new idea, afaik - it’d be like a prefetch_after_load <address>, that would not fault if <address> couldn’t be loaded. This hypothetical instruction would remove any need for expensive boundary checks - it’d probably invite its own other problems, like mem bandwidth pressure, so maybe we’re doing a “grass is greener on the other side” fallacy here.

Maybe we gave up too early? Hard to tell. My hope here is not to discourage but to set expectations about, or scope, some paths that, to us, seemed initially attractive. Maybe there are other unexplored paths that we discouraged ourselves out of too early, that others can try and succeed at?

[1] https://www.cl.cam.ac.uk/~sa614/papers/Software-Prefetching-CGO2017.pdf

1 Like