Arm neon ldrb before st2 error lead to compare fail

My llvm issue: arm neon ldrb before st2 error lead to compare fail · Issue #64696 · llvm/llvm-project · GitHub

clang: Compiler Explorer

gcc: Compiler Explorer

code

#include <arm_neon.h>

extern void abort (void);

int test_vst2_lane_u8 (const uint8_t *data) {
  uint8x8x2_t vectors;
  for (int i = 0; i < 2; i++, data += 8) {
    vectors.val[i] = vld1_u8 (data);
  }

  uint8_t temp[2];
  vst2_lane_u8 (temp, vectors, 6);

// printf("temp[0]: %d\n", temp[0]);
// printf("temp[1]: %d\n", temp[1]);
//printf("vectors.val[0][6]: %d\n", vget_lane_u8(vectors.val[0], 6));
//printf("vectors.val[1][6]: %d\n", vget_lane_u8(vectors.val[1], 6));

  for (int i = 0; i < 2; i++) {
    if (temp[i] != vget_lane_u8 (vectors.val[i], 6))   /* error */
      return 1;
  }
  return 0;
}

int main (int argc, char **argv)
{

  uint64_t orig_data[8] = {
    0x1234567890abcdefULL, 0x13579bdf02468aceULL,
  };

  if (test_vst2_lane_u8 ((const uint8_t *)orig_data))
    abort ();;

  return 0;
}

I see the asm

test_vst2_lane_u8:                      // @test_vst2_lane_u8
        sub     sp, sp, #16
        ldp     d0, d1, [x0]
        add     x8, sp, #12
        ldrb    w11, [sp, #13]               // temp[1]
        st2     { v0.b, v1.b }[6], [x8]      // vst2_lane_u8 (temp, vectors, 6);
        umov    w8, v0.b[6]                 
        ldrb    w9, [sp, #12]                 // temp[0]
        umov    w10, v1.b[6]
        cmp     w9, w8, uxtb
        ccmp    w11, w10, #0, eq
        cset    w0, ne
        add     sp, sp, #16
        ret

The error is caused because the temp[1] is loaded before st2, so the compare fail.

I switch it like :

        st2     { v0.b, v1.b }[6], [x8]      // vst2_lane_u8 (temp, vectors, 6);
        umov    w8, v0.b[6]
        ldrb    w11, [sp, #13]               // temp[1]                 
        ldrb    w9, [sp, #12]                 // temp[0]

and recompile it, it’s OK.

Ahead, see the llvm debug info

AArch64 Indirect Thunks
Before post-MI-sched:
# Machine code for function test_vst2_lane_u8: NoPHIs, TracksLiveness, NoVRegs, TiedOpsRewritten, TracksDebugUserValues
Frame Objects:
  fi#0: size=2, align=4, at location [SP-4]
Function Live Ins: $x0

bb.0.entry:
  liveins: $x0
  $sp = frame-setup SUBXri $sp, 16, 0
  frame-setup CFI_INSTRUCTION def_cfa_offset <mcsymbol .Ltmp0>16
  $x8 = ADDXri $sp, 12, 0
  renamable $d0, renamable $d1 = LDPDi renamable $x0, 0 :: (load (s64) from %ir.data, align 1), (load (s64) from %ir.vectors.sroa.5.0.data.sroa_idx, align 1)
  ST2i8 renamable $q0_q1, 6, killed renamable $x8 :: (store (s128) into %ir.temp)
  renamable $w8 = UMOVvi8 renamable $q0, 6
  renamable $w9 = LDRBBui $sp, 12 :: (dereferenceable load (s8) from %ir.temp, align 4, !tbaa !6)
  renamable $w10 = UMOVvi8 renamable $q1, 6, implicit killed $q0_q1
  renamable $w11 = LDRBBui $sp, 13 :: (dereferenceable load (s8) from %ir.arrayidx11.1)
  dead $wzr = SUBSWrx killed renamable $w9, killed renamable $w8, 0, implicit-def $nzcv
  CCMPWr killed renamable $w11, killed renamable $w10, 0, 0, implicit-def $nzcv, implicit $nzcv
  renamable $w0 = CSINCWr $wzr, $wzr, 0, implicit $nzcv
  $sp = frame-destroy ADDXri $sp, 16, 0
  frame-destroy CFI_INSTRUCTION def_cfa_offset 0
  RET undef $lr, implicit $w0

# End machine code for function test_vst2_lane_u8.

********** MI Scheduling **********
test_vst2_lane_u8:%bb.0 entry
  From: $x8 = ADDXri $sp, 12, 0
    To: $sp = frame-destroy ADDXri $sp, 16, 0
 RegionInstrs: 10
ScheduleDAGMI::schedule starting
SU(0):   $x8 = ADDXri $sp, 12, 0
  # preds left       : 0
  # succs left       : 2
  # rdefs left       : 0
  Latency            : 3
  Depth              : 0
  Height             : 11
  Successors:
    SU(3): Out  Latency=1
    SU(2): Data Latency=3 Reg=$x8
SU(1):   renamable $d0, renamable $d1 = LDPDi renamable $x0, 0 :: (load (s64) from %ir.data, align 1), (load (s64) from %ir.vectors.sroa.5.0.data.sroa_idx, align 1)
  # preds left       : 0
  # succs left       : 5
  # rdefs left       : 0
  Latency            : 5
  Depth              : 0
  Height             : 13
  Successors:
    SU(3): Data Latency=4 Reg=$q0
    SU(5): Data Latency=0 Reg=$q0_q1
    SU(2): Data Latency=5 Reg=$q0_q1
    SU(5): Data Latency=5 Reg=$q1
    SU(9): Anti Latency=0
SU(2):   ST2i8 renamable $q0_q1, 6, renamable $x8 :: (store (s128) into %ir.temp)
  # preds left       : 2
  # succs left       : 2
  # rdefs left       : 0
  Latency            : 5
  Depth              : 5
  Height             : 8
  Predecessors:
    SU(1): Data Latency=5 Reg=$q0_q1
    SU(0): Data Latency=3 Reg=$x8
  Successors:
    SU(3): Anti Latency=0
    SU(4): Ord  Latency=1 Memory
SU(3):   renamable $w8 = UMOVvi8 renamable $q0, 6
  # preds left       : 3
  # succs left       : 1
  # rdefs left       : 0
  Latency            : 4
  Depth              : 5
  Height             : 8
  Predecessors:
    SU(2): Anti Latency=0
    SU(1): Data Latency=4 Reg=$q0
    SU(0): Out  Latency=1
  Successors:
    SU(7): Data Latency=4 Reg=$w8
SU(4):   renamable $w9 = LDRBBui $sp, 12 :: (dereferenceable load (s8) from %ir.temp, align 4, !tbaa !6)
  # preds left       : 1
  # succs left       : 1
  # rdefs left       : 0
  Latency            : 3
  Depth              : 6
  Height             : 7
  Predecessors:
    SU(2): Ord  Latency=1 Memory
  Successors:
    SU(7): Data Latency=3 Reg=$w9
SU(5):   renamable $w10 = UMOVvi8 renamable $q1, 6, implicit $q0_q1
  # preds left       : 2
  # succs left       : 1
  # rdefs left       : 0
  Latency            : 4
  Depth              : 5
  Height             : 7
  Predecessors:
    SU(1): Data Latency=0 Reg=$q0_q1
    SU(1): Data Latency=5 Reg=$q1
  Successors:
    SU(8): Data Latency=4 Reg=$w10
SU(6):   renamable $w11 = LDRBBui $sp, 13 :: (dereferenceable load (s8) from %ir.arrayidx11.1)
  # preds left       : 0
  # succs left       : 1
  # rdefs left       : 0
  Latency            : 3
  Depth              : 0
  Height             : 6
  Successors:
    SU(8): Data Latency=3 Reg=$w11
SU(7):   dead $wzr = SUBSWrx renamable $w9, renamable $w8, 0, implicit-def $nzcv
  # preds left       : 2
  # succs left       : 2
  # rdefs left       : 0
  Latency            : 3
  Depth              : 9
  Height             : 4
  Predecessors:
    SU(4): Data Latency=3 Reg=$w9
    SU(3): Data Latency=4 Reg=$w8
  Successors:
    SU(8): Out  Latency=1
    SU(8): Data Latency=1 Reg=$nzcv
SU(8):   CCMPWr renamable $w11, renamable $w10, 0, 0, implicit-def $nzcv, implicit $nzcv
  # preds left       : 4
  # succs left       : 1
  # rdefs left       : 0
  Latency            : 3
  Depth              : 10
  Height             : 3
  Predecessors:
    SU(7): Out  Latency=1
    SU(7): Data Latency=1 Reg=$nzcv
    SU(6): Data Latency=3 Reg=$w11
    SU(5): Data Latency=4 Reg=$w10
  Successors:
    SU(9): Data Latency=3 Reg=$nzcv
SU(9):   renamable $w0 = CSINCWr $wzr, $wzr, 0, implicit $nzcv
  # preds left       : 2
  # succs left       : 0
  # rdefs left       : 0
  Latency            : 3
  Depth              : 13
  Height             : 0
  Predecessors:
    SU(8): Data Latency=3 Reg=$nzcv
    SU(1): Anti Latency=0
ExitSU:   $sp = frame-destroy ADDXri $sp, 16, 0
  # preds left       : 0
  # succs left       : 0
  # rdefs left       : 0
  Latency            : 0
  Depth              : 0
  Height             : 0
Disabled scoreboard hazard recognizer
Critical Path: (PGS-RR) 13
** ScheduleDAGMI::schedule picking next node
Queue TopQ.P: 
Queue TopQ.A: 0 1 6 
  TopQ.A RemainingLatency 0 + 0c > CritPath 13
  Cand SU(0) ORDER                              
  Cand SU(1) TOP-PATH                  13 cycles 
Pick Top TOP-PATH  
Scheduling SU(1) renamable $d0, renamable $d1 = LDPDi renamable $x0, 0 :: (load (s64) from %ir.data, align 1), (load (s64) from %ir.vectors.sroa.5.0.data.sroa_idx, align 1)
  Ready @0c
  CortexA55UnitLd +2x2u
  *** Critical resource CortexA55UnitLd: 2c
  TopQ.A BotLatency SU(1) 13c
  *** Max MOps 2 at cycle 0
Cycle: 1 TopQ.A
TopQ.A @1c
  Retired: 2
  Executed: 2c
  Critical: 2c, 2 CortexA55UnitLd
  ExpectedLatency: 0c
  - Resource limited.
** ScheduleDAGMI::schedule picking next node
  SU(6) CortexA55UnitLd[0]=2c
Queue TopQ.P: 5 6 
Queue TopQ.A: 0 
Pick Top ONLY1     
Scheduling SU(0) $x8 = ADDXri $sp, 12, 0
  Ready @1c
  CortexA55UnitALU +1x1u
TopQ.A @1c
  Retired: 3
  Executed: 2c
  Critical: 2c, 2 CortexA55UnitLd
  ExpectedLatency: 0c
  - Resource limited.
** ScheduleDAGMI::schedule picking next node
Cycle: 2 TopQ.A
Queue TopQ.P: 5 2 
Queue TopQ.A: 6 
Pick Top ONLY1     
Scheduling SU(6) renamable $w11 = LDRBBui $sp, 13 :: (dereferenceable load (s8) from %ir.arrayidx11.1)
  Ready @2c
  CortexA55UnitLd +1x2u
TopQ.A @2c
  Retired: 4
  Executed: 3c
  Critical: 3c, 3 CortexA55UnitLd
  ExpectedLatency: 0c
  - Resource limited.
** ScheduleDAGMI::schedule picking next node
Cycle: 3 TopQ.A
Cycle: 5 TopQ.A
Queue TopQ.P: 
Queue TopQ.A: 5 2 
  TopQ.A RemainingLatency 0 + 5c > CritPath 13
  Latency limited both directions.
  Cand SU(5) ORDER                              
  Cand SU(2) TOP-PATH                  8 cycles 
Pick Top TOP-PATH  
Scheduling SU(2) ST2i8 renamable $q0_q1, 6, renamable $x8 :: (store (s128) into %ir.temp)
  Ready @5c
  CortexA55UnitSt +2x2u
  TopQ.A TopLatency SU(2) 5c
TopQ.A @5c
  Retired: 5
  Executed: 5c
  Critical: 3c, 3 CortexA55UnitLd
  ExpectedLatency: 5c
  - Latency limited.
** ScheduleDAGMI::schedule picking next node
Queue TopQ.P: 4 
Queue TopQ.A: 5 3 
  TopQ.A RemainingLatency 0 + 5c > CritPath 13
  Latency limited both directions.
  Cand SU(5) ORDER                              
  Cand SU(3) TOP-PATH                  8 cycles 
Pick Top TOP-PATH  
Scheduling SU(3) renamable $w8 = UMOVvi8 renamable $q0, 6
  Ready @5c
  CortexA55UnitFPALU +1x1u
  *** Max MOps 2 at cycle 5
Cycle: 6 TopQ.A
TopQ.A @6c
  Retired: 6
  Executed: 6c
  Critical: 3c, 3 CortexA55UnitLd
  ExpectedLatency: 5c
  - Latency limited.
** ScheduleDAGMI::schedule picking next node
Queue TopQ.P: 
Queue TopQ.A: 5 4 
  TopQ.A RemainingLatency 0 + 6c > CritPath 13
  Latency limited both directions.
  Cand SU(5) ORDER                              
  Cand SU(4) ORDER                              
Pick Top ORDER     
Scheduling SU(4) renamable $w9 = LDRBBui $sp, 12 :: (dereferenceable load (s8) from %ir.temp, align 4, !tbaa !6)
  Ready @6c
  CortexA55UnitLd +1x2u
  TopQ.A TopLatency SU(4) 6c
TopQ.A @6c
  Retired: 7
  Executed: 6c
  Critical: 4c, 4 CortexA55UnitLd
  ExpectedLatency: 6c
  - Latency limited.
** ScheduleDAGMI::schedule picking next node
Queue TopQ.P: 7 
Queue TopQ.A: 5 
Pick Top ONLY1     
Scheduling SU(5) renamable $w10 = UMOVvi8 renamable $q1, 6, implicit $q0_q1
  Ready @6c
  CortexA55UnitFPALU +1x1u
  *** Max MOps 2 at cycle 6
Cycle: 7 TopQ.A
TopQ.A @7c
  Retired: 8
  Executed: 7c
  Critical: 4c, 4 CortexA55UnitLd
  ExpectedLatency: 6c
  - Latency limited.
** ScheduleDAGMI::schedule picking next node
Cycle: 9 TopQ.A
Queue TopQ.P: 
Queue TopQ.A: 7 
Pick Top ONLY1     
Scheduling SU(7) dead $wzr = SUBSWrx renamable $w9, renamable $w8, 0, implicit-def $nzcv
  Ready @9c
  CortexA55UnitALU +1x1u
  TopQ.A TopLatency SU(7) 9c
TopQ.A @9c
  Retired: 9
  Executed: 9c
  Critical: 4c, 4 CortexA55UnitLd
  ExpectedLatency: 9c
  - Latency limited.
** ScheduleDAGMI::schedule picking next node
Cycle: 10 TopQ.A
Queue TopQ.P: 
Queue TopQ.A: 8 
Pick Top ONLY1     
Scheduling SU(8) CCMPWr renamable $w11, renamable $w10, 0, 0, implicit-def $nzcv, implicit $nzcv
  Ready @10c
  *** Critical resource NumMicroOps: 5c
  CortexA55UnitALU +1x1u
  TopQ.A TopLatency SU(8) 10c
TopQ.A @10c
  Retired: 10
  Executed: 10c
  Critical: 5c, 10 MOps
  ExpectedLatency: 10c
  - Latency limited.
** ScheduleDAGMI::schedule picking next node
Cycle: 11 TopQ.A
Cycle: 13 TopQ.A
Queue TopQ.P: 
Queue TopQ.A: 9 
Pick Top ONLY1     
Scheduling SU(9) renamable $w0 = CSINCWr $wzr, $wzr, 0, implicit $nzcv
  Ready @13c
  CortexA55UnitALU +1x1u
  TopQ.A TopLatency SU(9) 13c
TopQ.A @13c
  Retired: 11
  Executed: 13c
  Critical: 5c, 11 MOps
  ExpectedLatency: 13c
  - Latency limited.
** ScheduleDAGMI::schedule picking next node
*** Final schedule for %bb.0 ***
SU(1):   renamable $d0, renamable $d1 = LDPDi renamable $x0, 0 :: (load (s64) from %ir.data, align 1), (load (s64) from %ir.vectors.sroa.5.0.data.sroa_idx, align 1)
SU(0):   $x8 = ADDXri $sp, 12, 0
SU(6):   renamable $w11 = LDRBBui $sp, 13 :: (dereferenceable load (s8) from %ir.arrayidx11.1)
SU(2):   ST2i8 renamable $q0_q1, 6, renamable $x8 :: (store (s128) into %ir.temp)
SU(3):   renamable $w8 = UMOVvi8 renamable $q0, 6
SU(4):   renamable $w9 = LDRBBui $sp, 12 :: (dereferenceable load (s8) from %ir.temp, align 4, !tbaa !6)
SU(5):   renamable $w10 = UMOVvi8 renamable $q1, 6, implicit $q0_q1
SU(7):   dead $wzr = SUBSWrx renamable $w9, renamable $w8, 0, implicit-def $nzcv
SU(8):   CCMPWr renamable $w11, renamable $w10, 0, 0, implicit-def $nzcv, implicit $nzcv
SU(9):   renamable $w0 = CSINCWr $wzr, $wzr, 0, implicit $nzcv

Fixup kills for %bb.0

gmi change from

  ST2i8 renamable $q0_q1, 6, killed renamable $x8 :: (store (s128) into %ir.temp)
  renamable $w8 = UMOVvi8 renamable $q0, 6
  renamable $w9 = LDRBBui $sp, 12 :: (dereferenceable load (s8) from %ir.temp, align 4, !tbaa !6)
  renamable $w10 = UMOVvi8 renamable $q1, 6, implicit killed $q0_q1
  renamable $w11 = LDRBBui $sp, 13 :: (dereferenceable load (s8) from %ir.arrayidx11.1)

to

SU(6):   renamable $w11 = LDRBBui $sp, 13 :: (dereferenceable load (s8) from %ir.arrayidx11.1)
SU(2):   ST2i8 renamable $q0_q1, 6, renamable $x8 :: (store (s128) into %ir.temp)
SU(3):   renamable $w8 = UMOVvi8 renamable $q0, 6
SU(4):   renamable $w9 = LDRBBui $sp, 12 :: (dereferenceable load (s8) from %ir.temp, align 4, !tbaa !6)
SU(5):   renamable $w10 = UMOVvi8 renamable $q1, 6, implicit $q0_q1

in the code

And I compare to the right ed(left is error, right is right)

Do you think the assessment in arm neon ldrb before st2 error lead to compare fail · Issue #64696 · llvm/llvm-project · GitHub is correct? If not, do you have some alternate theory?