- test: Compiler Explorer
void testWhileWR(int *data1, int *data2, int size) {
// #pragma unroll 2
#pragma GCC unroll 2
for (int i = 0; i < size; i++) {
data2[i] = i;
}
}
- assemble output of llvm:
.LBB0_2: // =>This Inner Loop Header: Depth=1
mov x11, x8 // 1st loop body
st1w { z1.s }, p0, [x1, x8, lsl #2]
incw x11
whilelo p1.s, x11, x9
b.pl .LBB0_4
add z1.s, z1.s, z0.s // start 2nd loop body
st1w { z1.s }, p1, [x10, x8, lsl #2]
inch x8
add z1.s, z1.s, z0.s
whilelo p0.s, x8, x9
b.mi .LBB0_2
llvm generate 2 similar loop bodies for the #pragma unroll 2
, and gcc doesn’t unroll when it have +sve
option (unroll without +sve option for gcc).
Because the sve already make full use the vector registers lanes
, so it doesn’t look like there’s any performance gain.