RFC: Code Prefetch Insertion

Thanks for sharing the data. Any insights on the L1 icache miss increase?

This optimization would also add more instructions, what is the total instruction count increase? what is the overall effect (i.e. application level metric) from increased instruction count + improved IPC?

Correct. The prefetchi* instruction targets the L2 cache. It’s not documented, but it’s not hard to verify with microbenchmarks. The increase in L1 icache is not from those instructions per se as it’s not reproducible when they are replaced by the same size nops. We believe it’s due to evictions from the L2 which subsequently cause evictions in the L1 icache (due to cache inclusivity).

1 Like

Yes. The intended pass could work for CSPGO as well. An interface can be defined to match with block IDs. The callsite index would be the same.

The full paths in LBR profiles are used to guide the prefetchit placement. So doing this solely based on the compiler’s edge and block profile is infeasible.

There will be separate efforts to open-source/upstream that part as well.

I’m trying to understand how this RFC relates to upstream LLVM. Propeller is an independent project. Does this RFC require changes only to Propeller, or also to upstream LLVM?

The upstream changes needed are:

  1. Generating symbols for prefetch targets: https://github.com/llvm/llvm-project/pull/168439
  2. Inserting prefetch instructions at requested positions: X86: Add prefetch insertion based on Propeller profile by rlavaee Ā· Pull Request #166324 Ā· llvm/llvm-project Ā· GitHub

Both of these need the mapping data from SHT_LLVM_BB_ADDR_MAP. We are planning to extend the mapping capability to AFDO as well for next year.

1 Like

I’m trying to understand how this RFC relates to upstream LLVM. Propeller is an independent project. Does this RFC require changes only to Propeller, or also to upstream LLVM?

Further to what @rlavaee said , we are working on porting the Propeller profile conversion tool in github to LLVM. @jinhuang1102 is working on a proposal for the same.

2 Likes

Thanks for the references, that clarifies things. The original RFC also mentioned that linker changes may be needed, is there a patch for that?


cc @MaskRay as this proposal seems to be doing some unusual things with symbols.

Thanks for the reminder. Here is the linker change: Resolve undefined prefetch targets to zero to effectively prefetch the next instruction. by rlavaee Ā· Pull Request #174448 Ā· llvm/llvm-project Ā· GitHub

Thanks for notifying me.

Hardcoding symbol names for symbol resolution and relocation processing is definitely not right.

The PC-relative prefetchit1 uses an R_X86_64_PC32 relocation. When the referenced symbol is undefined, the linker reports an error in -shared and -pie links.

Instead, handle this in the compiler. Before emitting the prefetch, if the target symbol isn’t defined in the current module, emit a weak fallback:

prefetchit1 __llvm_prefetch_target_foo(%rip)
.weak __llvm_prefetch_target_foo
__llvm_prefetch_target_foo:

When __llvm_prefetch_target_foo is defined elsewhere, emit it as STB_GLOBAL — the strong definition will override any weak ones.
This way, stale profiles gracefully degrade to prefetching the next instruction without requiring linker changes.

If you need semantics beyond what weak definitions provide, the path forward would be proposing a new relocation type on the x86-64 ABI list.

Thanks @MaskRay for the great idea. I will implement this soon.