Memcpy() not completely optimized away on Mac OS

Hi,

Here is an example (Compiler Explorer) of memcpy() not getting optimized away completely.

Targeting Linux results in the following assembly, which is good:

shuffle32b(unsigned char*, unsigned char const*):
        vmovdqu (%rsi), %ymm0
        vpshufb .LCPI0_0(%rip), %ymm0, %ymm0    # ymm0 = ymm0[0,2,4,6,1,3,5,7,8,10,12,14,9,11,13,15,16,18,20,22,17,19,21,23,24,26,28,30,25,27,29,31]
        vmovdqu %ymm0, (%rdi)
        vzeroupper
        retq

However, when targeting Mac OS, I get the following assembly output:

_shuffle32b:
       0:	55 	pushq	%rbp
       1:	48 89 e5 	movq	%rsp, %rbp
       4:	48 83 e4 e0 	andq	$-32, %rsp
       8:	48 83 ec 40 	subq	$64, %rsp
       c:	c5 fe 6f 06 	vmovdqu	(%rsi), %ymm0
      10:	c5 fd 7f 04 24 	vmovdqa	%ymm0, (%rsp)
      15:	c4 e2 7d 00 05 22 00 00 00 	vpshufb	34(%rip), %ymm0, %ymm0
      1e:	c5 fd 7f 04 24 	vmovdqa	%ymm0, (%rsp)
      23:	c5 fe 7f 07 	vmovdqu	%ymm0, (%rdi)
      27:	48 89 ec 	movq	%rbp, %rsp
      2a:	5d 	popq	%rbp
      2b:	c5 f8 77 	vzeroupper
      2e:	c3 	retq

At 10 and 1e, %ymm0 is being spilled onto the stack for no apparently reason at all, and causes all these extra instructions to be emitted for aligning the stack frame. However, if I change memcpy() to __builtin_memcpy(), then the generated code is the same as on linux. What’s special about Mac OS that causes memcpy() to behave in this strange way? This looks like a bug.

Different -D_FORTIFY_SOURCE defaults: Compiler Explorer

The __memcpy_chk seems to be eliminated quite late, after the final SROA pass that otherwise eliminates the memcpys. The __memcpy_chk is optimised to a normal memcpy intrinsic late because it can’t do that until the object size argument is known, and that’s left late for unknown object sizes (in this case, *dest’s size), done by LowerConstantIntrinsicsPass, at which point the next InstCombine optimises the __memcpy_chk.

1 Like

Yeah this is an unfortunate issue. Thanks for raising awareness of this again.

There are 2 potential improvements, I just need to get back to them.

⚙ D114401 [Passes] Run LowerConstantIntrinsics after SCCP/before DSE.
⚙ D115167 [DSE] Use precise loc for memset_chk writing to local objects.