llvm emits code for “memcpy” on ARM as consecutive ldr/str commands, and further combines them into ldm/stm with special pass after register allocation. But ldm/stm commands require registers to go in ascending order, what is often not so after regalloc, therefore some str/ldr commands. For example such code:
I ran different tests and always regalloc allocates at least one register not in ascending order.
What is your ideas to overcome this issue? Maybe llvm should emit code for “memcpy” straight into ldm/stm or exchange registers before combining ldr/str to make them go in ascending order or fix somehow register allocator?
llvm emits code for "memcpy" on ARM as consecutive ldr/str commands, and
Hmm, this happens elsewhere as well (x86?). Perhaps what we need is a
switch to disable memset/memcpy lowering?
Do you offer to call libc memset/memcpy functions always instead of intrinsic lowering? It seems not a good idea, because often (especially in cases of small chunks of memory) consecutive ldm/stm instructions are more efficient than memcpy call.
Seems like a little misunderstanding. I wrote about bitcode memcpy intrinsic, not memcpy from libc. Exactly this intrinsic is used in IR for stuctures coping as in my example. And lowering of memcpy intrinsic has mentioned issue on ARM.
We should handle this better. I’m not sure how to guarantee that we can generate ldm/stm without regalloc support. Our only idea is to teach the new register allocator to do a much better job satisfying register hints. If you’d like to track this, feel free to file a bug.