Memcpy() not completely optimized away on Mac OS


Here is an example (Compiler Explorer) of memcpy() not getting optimized away completely.

Targeting Linux results in the following assembly, which is good:

shuffle32b(unsigned char*, unsigned char const*):
        vmovdqu (%rsi), %ymm0
        vpshufb .LCPI0_0(%rip), %ymm0, %ymm0    # ymm0 = ymm0[0,2,4,6,1,3,5,7,8,10,12,14,9,11,13,15,16,18,20,22,17,19,21,23,24,26,28,30,25,27,29,31]
        vmovdqu %ymm0, (%rdi)

However, when targeting Mac OS, I get the following assembly output:

       0:	55 	pushq	%rbp
       1:	48 89 e5 	movq	%rsp, %rbp
       4:	48 83 e4 e0 	andq	$-32, %rsp
       8:	48 83 ec 40 	subq	$64, %rsp
       c:	c5 fe 6f 06 	vmovdqu	(%rsi), %ymm0
      10:	c5 fd 7f 04 24 	vmovdqa	%ymm0, (%rsp)
      15:	c4 e2 7d 00 05 22 00 00 00 	vpshufb	34(%rip), %ymm0, %ymm0
      1e:	c5 fd 7f 04 24 	vmovdqa	%ymm0, (%rsp)
      23:	c5 fe 7f 07 	vmovdqu	%ymm0, (%rdi)
      27:	48 89 ec 	movq	%rbp, %rsp
      2a:	5d 	popq	%rbp
      2b:	c5 f8 77 	vzeroupper
      2e:	c3 	retq

At 10 and 1e, %ymm0 is being spilled onto the stack for no apparently reason at all, and causes all these extra instructions to be emitted for aligning the stack frame. However, if I change memcpy() to __builtin_memcpy(), then the generated code is the same as on linux. What’s special about Mac OS that causes memcpy() to behave in this strange way? This looks like a bug.

Different -D_FORTIFY_SOURCE defaults: Compiler Explorer

The __memcpy_chk seems to be eliminated quite late, after the final SROA pass that otherwise eliminates the memcpys. The __memcpy_chk is optimised to a normal memcpy intrinsic late because it can’t do that until the object size argument is known, and that’s left late for unknown object sizes (in this case, *dest’s size), done by LowerConstantIntrinsicsPass, at which point the next InstCombine optimises the __memcpy_chk.

1 Like

Yeah this is an unfortunate issue. Thanks for raising awareness of this again.

There are 2 potential improvements, I just need to get back to them.

⚙ D114401 [Passes] Run LowerConstantIntrinsics after SCCP/before DSE.
⚙ D115167 [DSE] Use precise loc for memset_chk writing to local objects.