Thanks for sharing the data. Any insights on the L1 icache miss increase?
This optimization would also add more instructions, what is the total instruction count increase? what is the overall effect (i.e. application level metric) from increased instruction count + improved IPC?
Correct. The prefetchi* instruction targets the L2 cache. Itās not documented, but itās not hard to verify with microbenchmarks. The increase in L1 icache is not from those instructions per se as itās not reproducible when they are replaced by the same size nops. We believe itās due to evictions from the L2 which subsequently cause evictions in the L1 icache (due to cache inclusivity).
Yes. The intended pass could work for CSPGO as well. An interface can be defined to match with block IDs. The callsite index would be the same.
The full paths in LBR profiles are used to guide the prefetchit placement. So doing this solely based on the compilerās edge and block profile is infeasible.
There will be separate efforts to open-source/upstream that part as well.
Iām trying to understand how this RFC relates to upstream LLVM. Propeller is an independent project. Does this RFC require changes only to Propeller, or also to upstream LLVM?
Iām trying to understand how this RFC relates to upstream LLVM. Propeller is an independent project. Does this RFC require changes only to Propeller, or also to upstream LLVM?
Further to what @rlavaee said , we are working on porting the Propeller profile conversion tool in github to LLVM. @jinhuang1102 is working on a proposal for the same.
Hardcoding symbol names for symbol resolution and relocation processing is definitely not right.
The PC-relative prefetchit1 uses an R_X86_64_PC32 relocation. When the referenced symbol is undefined, the linker reports an error in -shared and -pie links.
Instead, handle this in the compiler. Before emitting the prefetch, if the target symbol isnāt defined in the current module, emit a weak fallback:
When __llvm_prefetch_target_foo is defined elsewhere, emit it as STB_GLOBAL ā the strong definition will override any weak ones.
This way, stale profiles gracefully degrade to prefetching the next instruction without requiring linker changes.
If you need semantics beyond what weak definitions provide, the path forward would be proposing a new relocation type on the x86-64 ABI list.