My llvm issue: arm neon ldrb before st2 error lead to compare fail · Issue #64696 · llvm/llvm-project · GitHub
clang: Compiler Explorer
gcc: Compiler Explorer
code
#include <arm_neon.h>
extern void abort (void);
int test_vst2_lane_u8 (const uint8_t *data) {
uint8x8x2_t vectors;
for (int i = 0; i < 2; i++, data += 8) {
vectors.val[i] = vld1_u8 (data);
}
uint8_t temp[2];
vst2_lane_u8 (temp, vectors, 6);
// printf("temp[0]: %d\n", temp[0]);
// printf("temp[1]: %d\n", temp[1]);
//printf("vectors.val[0][6]: %d\n", vget_lane_u8(vectors.val[0], 6));
//printf("vectors.val[1][6]: %d\n", vget_lane_u8(vectors.val[1], 6));
for (int i = 0; i < 2; i++) {
if (temp[i] != vget_lane_u8 (vectors.val[i], 6)) /* error */
return 1;
}
return 0;
}
int main (int argc, char **argv)
{
uint64_t orig_data[8] = {
0x1234567890abcdefULL, 0x13579bdf02468aceULL,
};
if (test_vst2_lane_u8 ((const uint8_t *)orig_data))
abort ();;
return 0;
}
I see the asm
test_vst2_lane_u8: // @test_vst2_lane_u8
sub sp, sp, #16
ldp d0, d1, [x0]
add x8, sp, #12
ldrb w11, [sp, #13] // temp[1]
st2 { v0.b, v1.b }[6], [x8] // vst2_lane_u8 (temp, vectors, 6);
umov w8, v0.b[6]
ldrb w9, [sp, #12] // temp[0]
umov w10, v1.b[6]
cmp w9, w8, uxtb
ccmp w11, w10, #0, eq
cset w0, ne
add sp, sp, #16
ret
The error is caused because the temp[1]
is loaded before st2
, so the compare fail.
I switch it like :
st2 { v0.b, v1.b }[6], [x8] // vst2_lane_u8 (temp, vectors, 6);
umov w8, v0.b[6]
ldrb w11, [sp, #13] // temp[1]
ldrb w9, [sp, #12] // temp[0]
and recompile it, it’s OK.
Ahead, see the llvm debug info
AArch64 Indirect Thunks
Before post-MI-sched:
# Machine code for function test_vst2_lane_u8: NoPHIs, TracksLiveness, NoVRegs, TiedOpsRewritten, TracksDebugUserValues
Frame Objects:
fi#0: size=2, align=4, at location [SP-4]
Function Live Ins: $x0
bb.0.entry:
liveins: $x0
$sp = frame-setup SUBXri $sp, 16, 0
frame-setup CFI_INSTRUCTION def_cfa_offset <mcsymbol .Ltmp0>16
$x8 = ADDXri $sp, 12, 0
renamable $d0, renamable $d1 = LDPDi renamable $x0, 0 :: (load (s64) from %ir.data, align 1), (load (s64) from %ir.vectors.sroa.5.0.data.sroa_idx, align 1)
ST2i8 renamable $q0_q1, 6, killed renamable $x8 :: (store (s128) into %ir.temp)
renamable $w8 = UMOVvi8 renamable $q0, 6
renamable $w9 = LDRBBui $sp, 12 :: (dereferenceable load (s8) from %ir.temp, align 4, !tbaa !6)
renamable $w10 = UMOVvi8 renamable $q1, 6, implicit killed $q0_q1
renamable $w11 = LDRBBui $sp, 13 :: (dereferenceable load (s8) from %ir.arrayidx11.1)
dead $wzr = SUBSWrx killed renamable $w9, killed renamable $w8, 0, implicit-def $nzcv
CCMPWr killed renamable $w11, killed renamable $w10, 0, 0, implicit-def $nzcv, implicit $nzcv
renamable $w0 = CSINCWr $wzr, $wzr, 0, implicit $nzcv
$sp = frame-destroy ADDXri $sp, 16, 0
frame-destroy CFI_INSTRUCTION def_cfa_offset 0
RET undef $lr, implicit $w0
# End machine code for function test_vst2_lane_u8.
********** MI Scheduling **********
test_vst2_lane_u8:%bb.0 entry
From: $x8 = ADDXri $sp, 12, 0
To: $sp = frame-destroy ADDXri $sp, 16, 0
RegionInstrs: 10
ScheduleDAGMI::schedule starting
SU(0): $x8 = ADDXri $sp, 12, 0
# preds left : 0
# succs left : 2
# rdefs left : 0
Latency : 3
Depth : 0
Height : 11
Successors:
SU(3): Out Latency=1
SU(2): Data Latency=3 Reg=$x8
SU(1): renamable $d0, renamable $d1 = LDPDi renamable $x0, 0 :: (load (s64) from %ir.data, align 1), (load (s64) from %ir.vectors.sroa.5.0.data.sroa_idx, align 1)
# preds left : 0
# succs left : 5
# rdefs left : 0
Latency : 5
Depth : 0
Height : 13
Successors:
SU(3): Data Latency=4 Reg=$q0
SU(5): Data Latency=0 Reg=$q0_q1
SU(2): Data Latency=5 Reg=$q0_q1
SU(5): Data Latency=5 Reg=$q1
SU(9): Anti Latency=0
SU(2): ST2i8 renamable $q0_q1, 6, renamable $x8 :: (store (s128) into %ir.temp)
# preds left : 2
# succs left : 2
# rdefs left : 0
Latency : 5
Depth : 5
Height : 8
Predecessors:
SU(1): Data Latency=5 Reg=$q0_q1
SU(0): Data Latency=3 Reg=$x8
Successors:
SU(3): Anti Latency=0
SU(4): Ord Latency=1 Memory
SU(3): renamable $w8 = UMOVvi8 renamable $q0, 6
# preds left : 3
# succs left : 1
# rdefs left : 0
Latency : 4
Depth : 5
Height : 8
Predecessors:
SU(2): Anti Latency=0
SU(1): Data Latency=4 Reg=$q0
SU(0): Out Latency=1
Successors:
SU(7): Data Latency=4 Reg=$w8
SU(4): renamable $w9 = LDRBBui $sp, 12 :: (dereferenceable load (s8) from %ir.temp, align 4, !tbaa !6)
# preds left : 1
# succs left : 1
# rdefs left : 0
Latency : 3
Depth : 6
Height : 7
Predecessors:
SU(2): Ord Latency=1 Memory
Successors:
SU(7): Data Latency=3 Reg=$w9
SU(5): renamable $w10 = UMOVvi8 renamable $q1, 6, implicit $q0_q1
# preds left : 2
# succs left : 1
# rdefs left : 0
Latency : 4
Depth : 5
Height : 7
Predecessors:
SU(1): Data Latency=0 Reg=$q0_q1
SU(1): Data Latency=5 Reg=$q1
Successors:
SU(8): Data Latency=4 Reg=$w10
SU(6): renamable $w11 = LDRBBui $sp, 13 :: (dereferenceable load (s8) from %ir.arrayidx11.1)
# preds left : 0
# succs left : 1
# rdefs left : 0
Latency : 3
Depth : 0
Height : 6
Successors:
SU(8): Data Latency=3 Reg=$w11
SU(7): dead $wzr = SUBSWrx renamable $w9, renamable $w8, 0, implicit-def $nzcv
# preds left : 2
# succs left : 2
# rdefs left : 0
Latency : 3
Depth : 9
Height : 4
Predecessors:
SU(4): Data Latency=3 Reg=$w9
SU(3): Data Latency=4 Reg=$w8
Successors:
SU(8): Out Latency=1
SU(8): Data Latency=1 Reg=$nzcv
SU(8): CCMPWr renamable $w11, renamable $w10, 0, 0, implicit-def $nzcv, implicit $nzcv
# preds left : 4
# succs left : 1
# rdefs left : 0
Latency : 3
Depth : 10
Height : 3
Predecessors:
SU(7): Out Latency=1
SU(7): Data Latency=1 Reg=$nzcv
SU(6): Data Latency=3 Reg=$w11
SU(5): Data Latency=4 Reg=$w10
Successors:
SU(9): Data Latency=3 Reg=$nzcv
SU(9): renamable $w0 = CSINCWr $wzr, $wzr, 0, implicit $nzcv
# preds left : 2
# succs left : 0
# rdefs left : 0
Latency : 3
Depth : 13
Height : 0
Predecessors:
SU(8): Data Latency=3 Reg=$nzcv
SU(1): Anti Latency=0
ExitSU: $sp = frame-destroy ADDXri $sp, 16, 0
# preds left : 0
# succs left : 0
# rdefs left : 0
Latency : 0
Depth : 0
Height : 0
Disabled scoreboard hazard recognizer
Critical Path: (PGS-RR) 13
** ScheduleDAGMI::schedule picking next node
Queue TopQ.P:
Queue TopQ.A: 0 1 6
TopQ.A RemainingLatency 0 + 0c > CritPath 13
Cand SU(0) ORDER
Cand SU(1) TOP-PATH 13 cycles
Pick Top TOP-PATH
Scheduling SU(1) renamable $d0, renamable $d1 = LDPDi renamable $x0, 0 :: (load (s64) from %ir.data, align 1), (load (s64) from %ir.vectors.sroa.5.0.data.sroa_idx, align 1)
Ready @0c
CortexA55UnitLd +2x2u
*** Critical resource CortexA55UnitLd: 2c
TopQ.A BotLatency SU(1) 13c
*** Max MOps 2 at cycle 0
Cycle: 1 TopQ.A
TopQ.A @1c
Retired: 2
Executed: 2c
Critical: 2c, 2 CortexA55UnitLd
ExpectedLatency: 0c
- Resource limited.
** ScheduleDAGMI::schedule picking next node
SU(6) CortexA55UnitLd[0]=2c
Queue TopQ.P: 5 6
Queue TopQ.A: 0
Pick Top ONLY1
Scheduling SU(0) $x8 = ADDXri $sp, 12, 0
Ready @1c
CortexA55UnitALU +1x1u
TopQ.A @1c
Retired: 3
Executed: 2c
Critical: 2c, 2 CortexA55UnitLd
ExpectedLatency: 0c
- Resource limited.
** ScheduleDAGMI::schedule picking next node
Cycle: 2 TopQ.A
Queue TopQ.P: 5 2
Queue TopQ.A: 6
Pick Top ONLY1
Scheduling SU(6) renamable $w11 = LDRBBui $sp, 13 :: (dereferenceable load (s8) from %ir.arrayidx11.1)
Ready @2c
CortexA55UnitLd +1x2u
TopQ.A @2c
Retired: 4
Executed: 3c
Critical: 3c, 3 CortexA55UnitLd
ExpectedLatency: 0c
- Resource limited.
** ScheduleDAGMI::schedule picking next node
Cycle: 3 TopQ.A
Cycle: 5 TopQ.A
Queue TopQ.P:
Queue TopQ.A: 5 2
TopQ.A RemainingLatency 0 + 5c > CritPath 13
Latency limited both directions.
Cand SU(5) ORDER
Cand SU(2) TOP-PATH 8 cycles
Pick Top TOP-PATH
Scheduling SU(2) ST2i8 renamable $q0_q1, 6, renamable $x8 :: (store (s128) into %ir.temp)
Ready @5c
CortexA55UnitSt +2x2u
TopQ.A TopLatency SU(2) 5c
TopQ.A @5c
Retired: 5
Executed: 5c
Critical: 3c, 3 CortexA55UnitLd
ExpectedLatency: 5c
- Latency limited.
** ScheduleDAGMI::schedule picking next node
Queue TopQ.P: 4
Queue TopQ.A: 5 3
TopQ.A RemainingLatency 0 + 5c > CritPath 13
Latency limited both directions.
Cand SU(5) ORDER
Cand SU(3) TOP-PATH 8 cycles
Pick Top TOP-PATH
Scheduling SU(3) renamable $w8 = UMOVvi8 renamable $q0, 6
Ready @5c
CortexA55UnitFPALU +1x1u
*** Max MOps 2 at cycle 5
Cycle: 6 TopQ.A
TopQ.A @6c
Retired: 6
Executed: 6c
Critical: 3c, 3 CortexA55UnitLd
ExpectedLatency: 5c
- Latency limited.
** ScheduleDAGMI::schedule picking next node
Queue TopQ.P:
Queue TopQ.A: 5 4
TopQ.A RemainingLatency 0 + 6c > CritPath 13
Latency limited both directions.
Cand SU(5) ORDER
Cand SU(4) ORDER
Pick Top ORDER
Scheduling SU(4) renamable $w9 = LDRBBui $sp, 12 :: (dereferenceable load (s8) from %ir.temp, align 4, !tbaa !6)
Ready @6c
CortexA55UnitLd +1x2u
TopQ.A TopLatency SU(4) 6c
TopQ.A @6c
Retired: 7
Executed: 6c
Critical: 4c, 4 CortexA55UnitLd
ExpectedLatency: 6c
- Latency limited.
** ScheduleDAGMI::schedule picking next node
Queue TopQ.P: 7
Queue TopQ.A: 5
Pick Top ONLY1
Scheduling SU(5) renamable $w10 = UMOVvi8 renamable $q1, 6, implicit $q0_q1
Ready @6c
CortexA55UnitFPALU +1x1u
*** Max MOps 2 at cycle 6
Cycle: 7 TopQ.A
TopQ.A @7c
Retired: 8
Executed: 7c
Critical: 4c, 4 CortexA55UnitLd
ExpectedLatency: 6c
- Latency limited.
** ScheduleDAGMI::schedule picking next node
Cycle: 9 TopQ.A
Queue TopQ.P:
Queue TopQ.A: 7
Pick Top ONLY1
Scheduling SU(7) dead $wzr = SUBSWrx renamable $w9, renamable $w8, 0, implicit-def $nzcv
Ready @9c
CortexA55UnitALU +1x1u
TopQ.A TopLatency SU(7) 9c
TopQ.A @9c
Retired: 9
Executed: 9c
Critical: 4c, 4 CortexA55UnitLd
ExpectedLatency: 9c
- Latency limited.
** ScheduleDAGMI::schedule picking next node
Cycle: 10 TopQ.A
Queue TopQ.P:
Queue TopQ.A: 8
Pick Top ONLY1
Scheduling SU(8) CCMPWr renamable $w11, renamable $w10, 0, 0, implicit-def $nzcv, implicit $nzcv
Ready @10c
*** Critical resource NumMicroOps: 5c
CortexA55UnitALU +1x1u
TopQ.A TopLatency SU(8) 10c
TopQ.A @10c
Retired: 10
Executed: 10c
Critical: 5c, 10 MOps
ExpectedLatency: 10c
- Latency limited.
** ScheduleDAGMI::schedule picking next node
Cycle: 11 TopQ.A
Cycle: 13 TopQ.A
Queue TopQ.P:
Queue TopQ.A: 9
Pick Top ONLY1
Scheduling SU(9) renamable $w0 = CSINCWr $wzr, $wzr, 0, implicit $nzcv
Ready @13c
CortexA55UnitALU +1x1u
TopQ.A TopLatency SU(9) 13c
TopQ.A @13c
Retired: 11
Executed: 13c
Critical: 5c, 11 MOps
ExpectedLatency: 13c
- Latency limited.
** ScheduleDAGMI::schedule picking next node
*** Final schedule for %bb.0 ***
SU(1): renamable $d0, renamable $d1 = LDPDi renamable $x0, 0 :: (load (s64) from %ir.data, align 1), (load (s64) from %ir.vectors.sroa.5.0.data.sroa_idx, align 1)
SU(0): $x8 = ADDXri $sp, 12, 0
SU(6): renamable $w11 = LDRBBui $sp, 13 :: (dereferenceable load (s8) from %ir.arrayidx11.1)
SU(2): ST2i8 renamable $q0_q1, 6, renamable $x8 :: (store (s128) into %ir.temp)
SU(3): renamable $w8 = UMOVvi8 renamable $q0, 6
SU(4): renamable $w9 = LDRBBui $sp, 12 :: (dereferenceable load (s8) from %ir.temp, align 4, !tbaa !6)
SU(5): renamable $w10 = UMOVvi8 renamable $q1, 6, implicit $q0_q1
SU(7): dead $wzr = SUBSWrx renamable $w9, renamable $w8, 0, implicit-def $nzcv
SU(8): CCMPWr renamable $w11, renamable $w10, 0, 0, implicit-def $nzcv, implicit $nzcv
SU(9): renamable $w0 = CSINCWr $wzr, $wzr, 0, implicit $nzcv
Fixup kills for %bb.0
gmi change from
ST2i8 renamable $q0_q1, 6, killed renamable $x8 :: (store (s128) into %ir.temp)
renamable $w8 = UMOVvi8 renamable $q0, 6
renamable $w9 = LDRBBui $sp, 12 :: (dereferenceable load (s8) from %ir.temp, align 4, !tbaa !6)
renamable $w10 = UMOVvi8 renamable $q1, 6, implicit killed $q0_q1
renamable $w11 = LDRBBui $sp, 13 :: (dereferenceable load (s8) from %ir.arrayidx11.1)
to
SU(6): renamable $w11 = LDRBBui $sp, 13 :: (dereferenceable load (s8) from %ir.arrayidx11.1)
SU(2): ST2i8 renamable $q0_q1, 6, renamable $x8 :: (store (s128) into %ir.temp)
SU(3): renamable $w8 = UMOVvi8 renamable $q0, 6
SU(4): renamable $w9 = LDRBBui $sp, 12 :: (dereferenceable load (s8) from %ir.temp, align 4, !tbaa !6)
SU(5): renamable $w10 = UMOVvi8 renamable $q1, 6, implicit $q0_q1
in the code
And I compare to the right ed(left is error, right is right)