RFC: Memcpy inlining in IR

Hi all,

For GlobalISel, we’re exploring options for implementing inlining optimizations for memcpy and friends. However, looking around the existing implementation, I don’t see anything that would particularly be problematic for us to do it at the IR level.

The existing TLI hooks to specify how certain memcpy calls should be lowered doesn’t have anything too SelectionDAG specific, and an IR lowering pass could be shared in future between SDAG and GISel. Does anyone see issues with this?

Thanks,
Amara

Hi all,

For GlobalISel, we’re exploring options for implementing inlining optimizations for memcpy and friends. However, looking around the existing implementation, I don’t see anything that would particularly be problematic for us to do it at the IR level.

The existing TLI hooks to specify how certain memcpy calls should be lowered doesn’t have anything too SelectionDAG specific, and an IR lowering pass could be shared in future between SDAG and GISel. Does anyone see issues with this?

Sounds similar to ⚙ D60318 [ExpandMemCmp][MergeICmps] Move passes out of CodeGen into opt pipeline.
It should be done *really* late in the middle-end pipeline though.

Thanks,
Amara

Roman.

Looks like there are a lot of opinions where memcpy expansion/inlining needs to happen: (late) IR, or if it is a backend problem, see also for example https://reviews.llvm.org/D35035. Complicating factor here is that efficient memcpy lowering is crucial for performance and code-size (and they occur a lot).

Either way, I agree that the TLI hooks are not SelectionDAG specific, they can be used in an IR lowering pass.

Cheers,
Sjoerd.

I agree that this should be a very late pass. Doing it in the IR would simplify the implementation in GlobalISel, but it would also allow us to perhaps have one shared expansion/optimization pass between both SDISel and GISel.

Volkan may look at upstreaming a partial implementation he has downstream.

Cheers,
Amara

We already have lib/Transforms/Utils/LowerMemIntrinsics.cpp, there just isn’t a general pass that expands these for targets. AMDGPU already always use this for memcpy handling.

-Matt

Sure, that might end up sharing some code but the key thing is to use the TLI hooks to implement the same optimizations that SelectionDAG currently does.

Amara

For CHERI, we have to be quite careful with memcpy because any pointer copy must be done with pointer load / store operations for all pointer sized-and-aligned places. I believe I've now found four (maybe five?) different places in LLVM where memcpy is expanded. Most of those are in the IR, not SelectionDAG.

If you're expanding to loads and stores, it's much better to do this at the IR level with SelectionDAG because you can insert flow control structures (so can emit loops), which I don't believe is a problem for GlobalISel. Some targets do not expand to loads and stores, for example on some x86 variants memcpy is expanded to a single REP MOVSB instruction. This is probably much easier to implement as a DAG pattern.

David

For CHERI, we have to be quite careful with memcpy because any pointer copy must be done with pointer load / store operations for all pointer sized-and-aligned places. I believe I've now found four (maybe five?) different places in LLVM where memcpy is expanded. Most of those are in the IR, not SelectionDAG.

Maybe I’m wrong, but it seems to me like as soon as there’s a memcpy on a pointer, the IR is incorrect for you target’s semantics. i.e. the language frontend should never emit a memcpy on pointers, and it would be wrong for your target to synthesize new memcpy from load / store on pointers. Clang definitely has the required information to respect these constraints, but I don’t think LLVM IR does.