Are the latencies of vextractf128 correct for Zen2/3 in MCA?

Dear LLVM community,

I have been trying to figure out something strange in MCA’s output on Zen2/3, and I am hoping you can help clarifying it. In the snippet below there is a combination of vextractf128 and vmovhlps that moves the upper 128 bits of ymm0 to xmm1, and the upper 64 bits of xmm0 to the lower half of xmm2.

Zen 2:
[0,0]     DeeeeeeeeER    . .   vmovapd  (%rdi), %ymm0
[0,1]     D=eeeeeeeeeeER . .   vsubpd   (%rsi), %ymm0, %ymm0
[0,2]     D===========eeeER.   vmulpd   %ymm0, %ymm0, %ymm0
[0,3]     D==============eER   vextractf128     $1, %ymm0, %xmm1
[0,4]     .D=============eER   vmovhlps %xmm0, %xmm0, %xmm2
Zen 3:
[0,0]     DeeeeeeeeER    .    .   vmovapd       (%rdi), %ymm0
[0,1]     D=eeeeeeeeeeER .    .   vsubpd        (%rsi), %ymm0, %ymm0
[0,2]     D===========eeeER   .   vmulpd        %ymm0, %ymm0, %ymm0
[0,3]     D==============eeeeER   vextractf128  $1, %ymm0, %xmm1
[0,4]     D==============eE---R   vmovhlps      %xmm0, %xmm0, %xmm2

On Zen 2 MCA’s output matches my expectations. Since the extract and vmovhl are reading from ymm0/xmm0 they can both start executing as soon as the mul is done. According to MCA this happens on both Zen 2 and 3, but here things diverge.
On Zen 3 vextractf128 takes quite a bit longer to run according to MCA, which keeps the vmovhl from retiring, since that needs to zero the upper 128 bits of ymm0.

My question is this: why does that extract take more time to execute on Zen 3? According to Agner Fog’s tables, the reg-reg version of vextractf128 (the one used here) has a 3 clock latency and a reciprocal throughput of 1 both on Zen 2 and 3. Is there a port conflict difference or something else I am not seeing here, or are the latencies in MCA (or Agner Fog’s tables) incorrect?

For vextractf128, the scheduling model definitions are different between znver2 and znver3. For znver2 we have the following:

// VEXTRACTF128 / VEXTRACTI128.
// x,y,i.
def : InstRW<[Zn2WriteFPU013], (instrs VEXTRACTF128rri,
                                       VEXTRACTI128rri)>;

// m128,y,i.
def : InstRW<[Zn2WriteFPU013m], (instrs VEXTRACTF128mri,
                                        VEXTRACTI128mri)>;

For znver3 we end up with the following:

def Zn3WriteVEXTRACTF128rr_VEXTRACTI128rr : SchedWriteRes<[Zn3FPFMisc0]> {
  let Latency = 4;
  let ReleaseAtCycles = [1];
  let NumMicroOps = 1;
}
def : InstRW<[Zn3WriteVEXTRACTF128rr_VEXTRACTI128rr], (instrs VEXTRACTF128rri, VEXTRACTI128rri)>;

def Zn3WriteVEXTRACTI128mr : SchedWriteRes<[Zn3FPFMisc0, Zn3FPSt, Zn3Store]> {
  let Latency = !add(Znver3Model.LoadLatency, Zn3WriteVEXTRACTF128rr_VEXTRACTI128rr.Latency);
  let ReleaseAtCycles = [1, 1, 1];
  let NumMicroOps = !add(Zn3WriteVEXTRACTF128rr_VEXTRACTI128rr.NumMicroOps, 1);
}
def : InstRW<[Zn3WriteVEXTRACTI128mr], (instrs VEXTRACTI128mri, VEXTRACTF128mri)>;

I can’t say just by looking at it which one is right (or if they both are or if neither of them are). llvm-exegesis might be able to help a little bit, but it cannot validate ReleaseAtCycles fields currently, so some manual benchmarking would probably be needed here (which llvm-exegesis can also help with).

1 Like

I think the relevant SchedWriteRes are:

//=== Integer MMX and XMM Instructions ===//

def Zn2WriteFPU013 : SchedWriteRes<[Zn2FPU013]> ;
def Zn2WriteFPU013m : SchedWriteRes<[Zn2AGU, Zn2FPU013]> {
  let Latency = 8;
  let NumMicroOps = 2;
}

MCA follows what’s provided by the scheduling model, so I guess the question is whether Znver2’s scheduling model has the correct latency for these two instructions

Thanks for the replies! I have easy access to both Zen 2 and Zen 3 HW, but I have never used llvm-exegesis or tried to benchmark such things.

OK so I have some test results, but I am not quite sure how to interpret them.
Trying to benchmark just the VEXTRACTF128rri isn with the default latency measurement mode of llvm-exegesis suggests that Zen 3 has regressed to 3.5 cycles (?) of latency from Zen 2’s 3 cycles.

# AMD EPYC 7402P 24-Core Processor
./llvm-exegesis -mode=latency -opcode-name=VEXTRACTF128rri -mcpu=znver2 --benchmark-repeat-count=100000
---
mode:            latency
key:
  instructions:
    - 'VEXTRACTF128rri XMM5 YMM5 i_0x1'
  config:          ''
  register_initial_values:
    - 'YMM5=0x0'
cpu_name:        znver2
llvm_triple:     x86_64-unknown-linux-gnu
min_instructions: 10000
measurements:
  - { key: latency, value: 3.0141, per_snippet_value: 3.0141, validation_counters: {} }
error:           ''
info:            Repeating a single explicitly serial instruction
assembled_snippet: 4883EC20C7042400000000C744240400000000C744240800000000C744240C00000000C744241000000000C744241400000000C744241800000000C744241C00000000C5FE6F2C244883C420C4E37D19ED01C4E37D19ED01C4E37D19ED01C4E37D19ED01C3
...
# AMD EPYC 7443P 24-Core Processor
./llvm-exegesis -mode=latency -opcode-name=VEXTRACTF128rri -mcpu=znver3 --benchmark-repeat-count=100000
---
mode:            latency
key:
  instructions:
    - 'VEXTRACTF128rri XMM7 YMM7 i_0x1'
  config:          ''
  register_initial_values:
    - 'YMM7=0x0'
cpu_name:        znver3
llvm_triple:     x86_64-unknown-linux-gnu
min_instructions: 10000
measurements:
  - { key: latency, value: 3.4476, per_snippet_value: 3.4476, validation_counters: {} }
error:           ''
info:            Repeating a single explicitly serial instruction
assembled_snippet: 4883EC20C7042400000000C744240400000000C744240800000000C744240C00000000C744241000000000C744241400000000C744241800000000C744241C00000000C5FE6F3C244883C420C4E37D19FF01C4E37D19FF01C4E37D19FF01C4E37D19FF01C3
...

But if I use the loop measurement mode both Zen 2/3 stay at 3 cycles:

./llvm-exegesis -mode=latency -opcode-name=VEXTRACTF128rri -mcpu=znver3 --benchmark-repeat-count=100000 --repetition-mode=loop
---
mode:            latency
key:
  instructions:
    - 'VEXTRACTF128rri XMM5 YMM5 i_0x1'
  config:          ''
  register_initial_values:
    - 'YMM5=0x0'
cpu_name:        znver3
llvm_triple:     x86_64-unknown-linux-gnu
min_instructions: 10000
measurements:
  - { key: latency, value: 3.0189, per_snippet_value: 3.0189, validation_counters: {} }
error:           ''
info:            Repeating a single explicitly serial instruction
assembled_snippet: 4883EC20C7042400000000C744240400000000C744240800000000C744240C00000000C744241000000000C744241400000000C744241800000000C744241C00000000C5FE6F2C244883C42049B80200000000000000662E0F1F840000000000C4E37D19ED01C4E37D19ED014983C0FF75EEC3
...

a single VEXTRACTF128rri instruction is 6-byte long, 10000 instructions take around 58KB space. A quick search showed that both EPYC 7443P and 7402P only have 32KB L1 I$, so I don’t think it’s a fair comparison given the noise caused by I$ misses. And maybe that’s why using loop as repetition method gives a more stable result.

You should be able to explicitly test the ICache pressure hypothesis by using validation counters. It’s the --validation-counter=l1i-cache-load-misses. No idea if it works on modern AMD CPUs though.

I have retested with -min-instructions=1000 and the results are between 3.07 and 3.13 for both default and loop modes on Zen2, quite repeatably. On Zen3 I get 3.48 to 3.57 for the default repetition mode and 3.07 to 3.12 with the loop mode.

So all four tests are quite reproducible for me, the only odd one is Zen3 with the default repetition mode, and changing -min-instructions from the default 10k to 1k has very little effect.

Well, I tried that and there are a lot more cache misses reported on Zen2, with both repetition modes. Despite that, as you can see above, there is no difference in the measured latency between Zen2/3 if the loop mode is used.

PS. And if anything more iCache misses would surely increase apparent latency on Zen2, not decrease it, right?