Dear LLVM community,
I have been trying to figure out something strange in MCA’s output on Zen2/3, and I am hoping you can help clarifying it. In the snippet below there is a combination of vextractf128 and vmovhlps that moves the upper 128 bits of ymm0 to xmm1, and the upper 64 bits of xmm0 to the lower half of xmm2.
Zen 2:
[0,0] DeeeeeeeeER . . vmovapd (%rdi), %ymm0
[0,1] D=eeeeeeeeeeER . . vsubpd (%rsi), %ymm0, %ymm0
[0,2] D===========eeeER. vmulpd %ymm0, %ymm0, %ymm0
[0,3] D==============eER vextractf128 $1, %ymm0, %xmm1
[0,4] .D=============eER vmovhlps %xmm0, %xmm0, %xmm2
Zen 3:
[0,0] DeeeeeeeeER . . vmovapd (%rdi), %ymm0
[0,1] D=eeeeeeeeeeER . . vsubpd (%rsi), %ymm0, %ymm0
[0,2] D===========eeeER . vmulpd %ymm0, %ymm0, %ymm0
[0,3] D==============eeeeER vextractf128 $1, %ymm0, %xmm1
[0,4] D==============eE---R vmovhlps %xmm0, %xmm0, %xmm2
On Zen 2 MCA’s output matches my expectations. Since the extract and vmovhl are reading from ymm0/xmm0 they can both start executing as soon as the mul is done. According to MCA this happens on both Zen 2 and 3, but here things diverge.
On Zen 3 vextractf128 takes quite a bit longer to run according to MCA, which keeps the vmovhl from retiring, since that needs to zero the upper 128 bits of ymm0.
My question is this: why does that extract take more time to execute on Zen 3? According to Agner Fog’s tables, the reg-reg version of vextractf128 (the one used here) has a 3 clock latency and a reciprocal throughput of 1 both on Zen 2 and 3. Is there a port conflict difference or something else I am not seeing here, or are the latencies in MCA (or Agner Fog’s tables) incorrect?