FreeRTOS queue much slower on RISCV when compiled with llvm rather than gcc

we have compiled FreeRTOS on RISCV with LLVM/gcc and noticed that FreeRTOS memory management (specially FreeRTOS queue) is much slower with LLVM. I wrote an script to compare assembly generated codes for the parts where clang was slower and it turned out that instead of “LD” (load double word) that gcc generates clang generates 4 “LB” (load byte) with extra move and branch instructions, at first glance it seems that gcc somehow can reason about data alignment which LLVM can not, I was wondering if anyone could provide some insight or has some potential solution.

Thank you in advance

Please provide more details like C code, compiler commands, output assemblies, etc.


At first glance, it seems to be related to unaligned memory access.

Many thanks for your comment,

the c code of the main bottleneck for our benchmarks is here (FreeRTOS kernel qsend):

gcc compiled kernel executes gcc compiled code with following compile/link flags:

C_FLAGS: -march=rv64imafdc -mcmodel=medany -fno-builtin-printf -fno-plt -fno-pic -fno-exceptions -fno-stack-protector -U_FORTIFY_SOURCE -O3 -DNDEBUG -ffunction-sections -fdata-sections -std=gnu99

L_FLAGS: -march=rv64imafdc -mcmodel=medany -fno-builtin-printf -fno-plt -fno-pic -fno-exceptions -fno-stack-protector -U_FORTIFY_SOURCE -O3 -DNDEBUG -U_FORTIFY_SOURCE -nostartfiles -static -nodefaultlibs -Wl,–gc-sections

clang compiled kernel executes clang compiled code with the following flags:
C_FLAGS = -march=rv64imafdc -mcmodel=medany -fno-builtin-printf -fno-plt -fno-pic -fno-exceptions -fno-stack-protector -U_FORTIFY_SOURCE -Xclang -target-feature -Xclang +experimental-p -Xclang -target-feature -Xclang +ls --target=riscv64-unknown-linux-gnu -O3 -DNDEBUG -ffunction-sections -fdata-sections -std=gnu99

FLAGS: -march=rv64imafdc -mcmodel=medany -fno-builtin-printf -fno-plt -fno-pic -fno-exceptions -fno-stack-protector -U_FORTIFY_SOURCE -nostartfiles -static -nodefaultlibs -Xclang -target-feature -Xclang +experimental-p -Xclang -target-feature -Xclang +ls -O3 -DNDEBUG --target=riscv64-unknown-linux-gnu -Wl,–gc-sections

the data that is being sent (or received from) to the queue is defined with aligned annotation like:
typedef struct attribute((aligned(16))) MessageIn { …}

I hope this info is a bit helpful, please let me know if anything is not clear.

Can you paste the assembly diffs between gcc and clang and version of clang you are using (clang -v)?

we are using clang 15 and gcc 10.3, the assembly diff is huge so can not paste it here unfortunately.

Here total assembly instruction of clang compiled:
addi : 2472, sb : 1220, mv : 1201, bltu : 1185, lb : 1168, addiw : 189, ld : 149
beq : 127, sd : 124, slli : 121, lbu : 111, lui : 107, andi : 70, li : 67, jal : 65
bne : 44, j : 20, auipc : 18, add : 18, bge : 15, blt : 14, srli : 10, lw : 7
remu : 6, sw : 6, addw : 6, sltu : 4, or : 4, divu : 4, bgeu : 4, sltiu : 3
csrrs : 3, mul : 2, csrrw : 2, slt : 2, and : 2, csrrci : 1, sll : 1, csrrsi : 1
mret : 1 , mulhu : 1

assembly instruction count differnece between two files, positive means more instructions in the clang compiled:
addi : 1909 , bltu : 1176, sb : 1168 , lb : 1168, mv : 1157 , ld : -187 , bne : -161
sd : -150 , addiw : 116, lw : -57, slli : 56 , sw : -54 , jal : -52, lui : 51, auipc : -47
andi : -44, li : -30, beq : 16, blt : 10, bgeu : -10, srli : 9, bge : 8, j : 7, add : -7
addw : -6, sltu : 4, csrrs : 3, csrrw : 2, slt : 2, or : 1, divu : -1, mul : 1, and : 1
mret : 1, mulhu : 1, lbu : 0, remu : 0, sltiu : 0, csrrci : 0, sll : 0, csrrsi : 0

A diff fragment would help.

It seems a lot of lw/sw, ld/sd are replaced by lb/sb. If your target CPU supports unaligned memory access, please specify -Xclang -target-feature -Xclang +unaligned-scalar-mem to see if there are some improvements.

our RISV-CPU does not support unaligned mem access.

here is a part of asm code generated by clang:

addiw (c.addiw) s7, s7, 1
addi (c.addi) s0, s0, 1
lbu a0, 4095(s0)
bne (c.bnez) a0, 46430
j (c.j) 466b0
mv (c.mv) a0, s7
ld (c.ldsp) ra, 376(sp)
ld (c.ldsp) s0, 368(sp)
ld (c.ldsp) s1, 360(sp)
ld (c.ldsp) s2, 352(sp)
ld (c.ldsp) s3, 344(sp)
ld (c.ldsp) s4, 336(sp)
ld (c.ldsp) s5, 328(sp)
ld (c.ldsp) s6, 320(sp)
ld (c.ldsp) s7, 312(sp)
ld (c.ldsp) s8, 304(sp)
ld (c.ldsp) s9, 296(sp)
ld (c.ldsp) s10, 288(sp)
ld (c.ldsp) s11, 280(sp)
addi (c.addi16sp) sp, sp, 384
ret
bne (c.bnez) s0, 418d2
ld (c.ldsp) ra, 56(sp)
ld (c.ldsp) s0, 48(sp)
ld (c.ldsp) s1, 40(sp)
ld (c.ldsp) s2, 32(sp)
ld (c.ldsp) s3, 24(sp)
ld (c.ldsp) s4, 16(sp)
addi (c.addi16sp) sp, sp, 112
ret
ld a2, 0(s3)
addi (c.addi4spn) a1, sp, 48
mv (c.mv) a0, s10
li (c.li) a3, 0
jal ra, 41bb8
addi (c.addi16sp) sp, sp, -96
sd (c.sdsp) ra, 88(sp)
sd (c.sdsp) s0, 80(sp)
sd (c.sdsp) s1, 72(sp)
sd (c.sdsp) s2, 64(sp)
sd (c.sdsp) s3, 56(sp)
sd (c.sdsp) s4, 48(sp)
sd (c.sdsp) s5, 40(sp)
sd (c.sdsp) s6, 32(sp)
mv (c.mv) s4, a3
mv (c.mv) s1, a2
mv (c.mv) s2, a1
mv (c.mv) s0, a0
sd (c.sdsp) a2, 24(sp)
beq (c.beqz) a0, 41bdc
bne s2, zero, 22
li (c.li) a0, 2
bne s4, a0, 16
jal ra, 44c06
auipc a0, 0xfb
addi a0, a0, 1130
ld (c.ld) a0, 0(a0)
beq (c.beqz) a0, 44c24
auipc a0, 0xfb
addi a0, a0, 1126
ld (c.ld) a0, 0(a0)
sltiu a0, a0, 0x1
slli (c.slli) a0, a0, 1
ret
sltu a0, zero, a0
sltiu a1, s1, 0x1
or (c.or) a0, a0, a1’
bne (c.bnez) a0, 41c12
csrrci zero, 08, csr 768
addi s1, gp, -2024
ld (c.ld) a1, 0(s1)
addi a0, a1, 1
sd (c.sd) a0, 0(s1)
ld (c.ld) a2, 112(s0)
ld (c.ld) a3, 120(s0)
addi a4, s4, -2
sltiu s5, a4, 0x1
sltu a2, a2, a3
or a2, s5, a2
beq (c.beqz) a2, 41c74
ld s3, 112(s0)
ld (c.ld) a2, 128(s0)
beq a2, zero, 264
beq s4, zero, 276
ld (c.ld) a0, 8(s0)
mv (c.mv) a1, s2
jal ra, 46f1a
or a3, a0, a1
andi a4, a3, 0x7
beq (c.beqz) a4, 46f46
andi a6, a2, 0x7
beq a6, zero, 36
c.andi a2, zero, -48
bge zero, a2, 108
add a4, a0, a6
add (c.add) a2, a2, a4
add (c.add) a1, a1, a6
lb a3, 0(a1)
addi (c.addi) a1, a1, 1
addi a5, a4, 1
sb a3, 0(a4)
mv (c.mv) a4, a5
bltu a5, a2, -16
lb a3, 0(a1)
addi (c.addi) a1, a1, 1
addi a5, a4, 1
sb a3, 0(a4)
mv (c.mv) a4, a5
bltu a5, a2, -16
lb a3, 0(a1)
addi (c.addi) a1, a1, 1
addi a5, a4, 1
sb a3, 0(a4)
mv (c.mv) a4, a5
bltu a5, a2, -16
lb a3, 0(a1)
addi (c.addi) a1, a1, 1
addi a5, a4, 1
sb a3, 0(a4)
mv (c.mv) a4, a5
bltu a5, a2, -16
lb a3, 0(a1)
addi (c.addi) a1, a1, 1
addi a5, a4, 1
sb a3, 0(a4)
mv (c.mv) a4, a5
bltu a5, a2, -16
lb a3, 0(a1)
addi (c.addi) a1, a1, 1
addi a5, a4, 1
sb a3, 0(a4)
mv (c.mv) a4, a5
bltu a5, a2, -16
lb a3, 0(a1)
addi (c.addi) a1, a1, 1
addi a5, a4, 1
sb a3, 0(a4)
mv (c.mv) a4, a5
bltu a5, a2, -16
lb a3, 0(a1)
addi (c.addi) a1, a1, 1
addi a5, a4, 1
sb a3, 0(a4)
mv (c.mv) a4, a5
bltu a5, a2, -16
lb a3, 0(a1)
addi (c.addi) a1, a1, 1
addi a5, a4, 1
sb a3, 0(a4)
mv (c.mv) a4, a5
bltu a5, a2, -16
lb a3, 0(a1)
addi (c.addi) a1, a1, 1
addi a5, a4, 1
sb a3, 0(a4)
mv (c.mv) a4, a5
bltu a5, a2, -16
lb a3, 0(a1)
addi (c.addi) a1, a1, 1
addi a5, a4, 1
sb a3, 0(a4)
mv (c.mv) a4, a5
bltu a5, a2, -16
lb a3, 0(a1)
addi (c.addi) a1, a1, 1
addi a5, a4, 1
sb a3, 0(a4)
mv (c.mv) a4, a5
bltu a5, a2, -16
lb a3, 0(a1)
addi (c.addi) a1, a1, 1
addi a5, a4, 1
sb a3, 0(a4)
mv (c.mv) a4, a5
bltu a5, a2, -16
lb a3, 0(a1)
addi (c.addi) a1, a1, 1
addi a5, a4, 1
sb a3, 0(a4)
mv (c.mv) a4, a5
bltu a5, a2, -16
lb a3, 0(a1)
addi (c.addi) a1, a1, 1
addi a5, a4, 1
sb a3, 0(a4)
mv (c.mv) a4, a5
bltu a5, a2, -16
lb a3, 0(a1)
addi (c.addi) a1, a1, 1
addi a5, a4, 1
sb a3, 0(a4)
mv (c.mv) a4, a5
bltu a5, a2, -16
lb a3, 0(a1)
addi (c.addi) a1, a1, 1
addi a5, a4, 1
sb a3, 0(a4)
mv (c.mv) a4, a5
bltu a5, a2, -16
lb a3, 0(a1)
addi (c.addi) a1, a1, 1
addi a5, a4, 1
sb a3, 0(a4)
mv (c.mv) a4, a5
bltu a5, a2, -16
lb a3, 0(a1)
addi (c.addi) a1, a1, 1
addi a5, a4, 1
sb a3, 0(a4)
mv (c.mv) a4, a5
bltu a5, a2, -16
lb a3, 0(a1)
addi (c.addi) a1, a1, 1
addi a5, a4, 1
sb a3, 0(a4)
mv (c.mv) a4, a5
bltu a5, a2, -16
lb a3, 0(a1)
addi (c.addi) a1, a1, 1
addi a5, a4, 1
sb a3, 0(a4)
mv (c.mv) a4, a5
bltu a5, a2, -16
lb a3, 0(a1)
addi (c.addi) a1, a1, 1
addi a5, a4, 1
sb a3, 0(a4)
mv (c.mv) a4, a5
bltu a5, a2, -16
lb a3, 0(a1)
addi (c.addi) a1, a1, 1
addi a5, a4, 1
sb a3, 0(a4)
mv (c.mv) a4, a5
bltu a5, a2, -16
lb a3, 0(a1)
addi (c.addi) a1, a1, 1
addi a5, a4, 1
sb a3, 0(a4)
mv (c.mv) a4, a5
bltu a5, a2, -16
lb a3, 0(a1)
addi (c.addi) a1, a1, 1
addi a5, a4, 1
sb a3, 0(a4)
mv (c.mv) a4, a5
bltu a5, a2, -16
lb a3, 0(a1)
addi (c.addi) a1, a1, 1
addi a5, a4, 1
sb a3, 0(a4)
mv (c.mv) a4, a5
bltu a5, a2, -16
lb a3, 0(a1)
addi (c.addi) a1, a1, 1
addi a5, a4, 1
sb a3, 0(a4)
mv (c.mv) a4, a5
bltu a5, a2, -16
lb a3, 0(a1)
addi (c.addi) a1, a1, 1
addi a5, a4, 1
sb a3, 0(a4)
mv (c.mv) a4, a5
bltu a5, a2, -16
lb a3, 0(a1)
addi (c.addi) a1, a1, 1
addi a5, a4, 1
sb a3, 0(a4)
mv (c.mv) a4, a5
bltu a5, a2, -16
lb a3, 0(a1)
addi (c.addi) a1, a1, 1
addi a5, a4, 1
sb a3, 0(a4)
mv (c.mv) a4, a5
bltu a5, a2, -16
lb a3, 0(a1)
addi (c.addi) a1, a1, 1
addi a5, a4, 1
sb a3, 0(a4)
mv (c.mv) a4, a5
bltu a5, a2, -16
lb a3, 0(a1)
addi (c.addi) a1, a1, 1
addi a5, a4, 1
sb a3, 0(a4)
mv (c.mv) a4, a5
bltu a5, a2, -16
lb a3, 0(a1)
addi (c.addi) a1, a1, 1

and generated assembly by gcc:

ld (c.ldsp) ra, 8(sp)
li (c.li) a0, 1
addi (c.addi) sp, sp, 16
ret
lw (c.lw) a5, 0(s0)
addi (c.addi) s1, s1, 1
addiw (c.addiw) a5, a5, 1
sw (c.sw) a5, 0(s0)
lbu a0, 0(s1)
bne a0, s3, -18
beq (c.beqz) a0, 40866
ld (c.ldsp) ra, 344(sp)
ld (c.ldsp) s0, 336(sp)
ld (c.ldsp) s1, 328(sp)
ld (c.ldsp) s2, 320(sp)
ld (c.ldsp) s3, 312(sp)
ld (c.ldsp) s4, 304(sp)
ld (c.ldsp) s5, 296(sp)
ld (c.ldsp) s6, 288(sp)
ld (c.ldsp) s7, 280(sp)
ld (c.ldsp) s8, 272(sp)
ld (c.ldsp) s9, 264(sp)
ld (c.ldsp) s10, 256(sp)
addi (c.addi16sp) sp, sp, 352
ret
ld (c.ldsp) ra, 24(sp)
c.lwsp a0, 12
addi (c.addi16sp) sp, sp, 32
ret
bne (c.bnez) s0, 444f2
ld (c.ldsp) ra, 56(sp)
ld (c.ldsp) s0, 48(sp)
ld (c.ldsp) s1, 40(sp)
ld (c.ldsp) s2, 32(sp)
ld (c.ldsp) s3, 24(sp)
ld (c.ldsp) s4, 16(sp)
addi (c.addi16sp) sp, sp, 112
ret
ld a2, 0(s3)
li (c.li) a3, 0
mv (c.mv) a1, sp
mv (c.mv) a0, s9
jal ra, 467c2
addi (c.addi16sp) sp, sp, -96
sd (c.sdsp) s0, 80(sp)
sd (c.sdsp) s3, 56(sp)
sd (c.sdsp) s5, 40(sp)
sd (c.sdsp) ra, 88(sp)
sd (c.sdsp) s1, 72(sp)
sd (c.sdsp) s2, 64(sp)
sd (c.sdsp) s4, 48(sp)
sd (c.sdsp) a2, 8(sp)
mv (c.mv) s0, a0
mv (c.mv) s5, a1
mv (c.mv) s3, a3
beq a0, zero, 394
beq s5, zero, 318
li (c.li) a5, 2
beq s3, a5, 298
jal ra, 43a42
auipc a5, 0xfc
ld a5, 1630(a5)
li (c.li) a0, 1
beq (c.beqz) a5, 43a5c
auipc a0, 0xfc
ld a0, 1562(a0)
sltiu a0, a0, 0x1
slli (c.slli) a0, a0, 1
ret
bne (c.bnez) a0, 467f4
csrrci zero, 08, csr 768
auipc s1, 0xfa
addi s1, s1, -2032
ld (c.ld) a4, 0(s1)
ld (c.ld) a2, 112(s0)
ld (c.ld) a3, 120(s0)
addi a5, a4, 1
sd (c.sd) a5, 0(s1)
bltu a2, a3, 152
ld (c.ld) a2, 128(s0)
ld s2, 112(s0)
bne (c.bnez) a2, 46934
bne s3, zero, 60
ld (c.ld) a0, 8(s0)
mv (c.mv) a1, s5
addi (c.addi) s2, s2, 1
jal ra, 40d74
or a5, a1, a0
andi a4, a5, 0x7
add a6, a0, a2
beq (c.beqz) a4, 40da2
andi t1, a2, 0x7
slli a7, t1, 3
add (c.add) a7, a7, a0
mv (c.mv) a5, a0
mv (c.mv) a4, a1
bgeu a0, a7, 18
add a5, a0, t1
add a7, a1, t1
bgeu a5, a6, 476
addi a4, t1, 1
add (c.add) a4, a4, a1
sub t1, a2, t1
sub a4, a5, a4
addi a3, t1, -1
sltiu a4, a4, 0x7
sltiu a3, a3, 0x8
xori a3, a3, 0x1
xori a4, a4, 0x1
and (c.and) a4, a4, a3’
andi a4, a4, 0xff
mv (c.mv) a3, a5
beq (c.beqz) a4, 40e8e
or a4, a5, a7
c.andi a4, zero, 54
bne (c.bnez) a4, 40e8e
andi a1, t1, 0x-8
mv (c.mv) a4, a7
add (c.add) a1, a1, a7
ld (c.ld) a2, 0(a4)
addi (c.addi) a4, a4, 8
addi (c.addi) a3, a3, 8
sd a2, -8(a3)
bne a4, a1, -10
ld (c.ld) a2, 0(a4)
addi (c.addi) a4, a4, 8
addi (c.addi) a3, a3, 8
sd a2, -8(a3)
bne a4, a1, -10
ld (c.ld) a2, 0(a4)
addi (c.addi) a4, a4, 8
addi (c.addi) a3, a3, 8
sd a2, -8(a3)
bne a4, a1, -10
ld (c.ld) a2, 0(a4)
addi (c.addi) a4, a4, 8
addi (c.addi) a3, a3, 8
sd a2, -8(a3)
bne a4, a1, -10
ld (c.ld) a2, 0(a4)
addi (c.addi) a4, a4, 8
addi (c.addi) a3, a3, 8
sd a2, -8(a3)
bne a4, a1, -10
ld (c.ld) a2, 0(a4)
addi (c.addi) a4, a4, 8
addi (c.addi) a3, a3, 8
sd a2, -8(a3)
bne a4, a1, -10
ld (c.ld) a2, 0(a4)
addi (c.addi) a4, a4, 8
addi (c.addi) a3, a3, 8
sd a2, -8(a3)
bne a4, a1, -10
ld (c.ld) a2, 0(a4)
addi (c.addi) a4, a4, 8
addi (c.addi) a3, a3, 8
sd a2, -8(a3)
bne a4, a1, -10
ld (c.ld) a2, 0(a4)
addi (c.addi) a4, a4, 8
addi (c.addi) a3, a3, 8
sd a2, -8(a3)
bne a4, a1, -10
ld (c.ld) a2, 0(a4)

OK, I get it.
Please see GCC and LLVM.
The key point here is that GCC “vectorize” the loop while LLVM doesn’t.
IIRC, this is a known issue (not RISC-V specific).

Huh? I don’t know what conclusion you’re trying to draw here, but I think you’re wildly off base.

Neither of the links above include “v” in the mattr string. As such, neither compiler is vectorizing this example. If you add v to the arch string, LLVM does vectorize the result, whereas gcc does not. This seems like the direct opposite of what you claimed, so I’m very confused by your intent.

Why are you removing all the horizontal white space AND the labels that make ASM so much easier to read and contemplate ??

My bad, I was going to sleep so I left this unclear comment. Let me explain it in more details.

#define N 65535

typedef struct A{
    char data[N];
} __attribute__((aligned(16))) A;

void foo(A* a, A* b) {
    for(int i = 0; i < N; i ++){
      a->data[i] = b->data[i];
    }
}

Compile this C code using riscv64-unknown-linux-gnu-gcc -march=rv64imafdc -O3 reduced.c -S -fno-builtin, we will get these assemblies (I left some comments):

foo:
	li	a2,65536
	addi	a2,a2,-8    # See if trip count is a multiplier of 8.
	mv	a5,a1
	mv	a4,a0
	add	a2,a1,a2
.L2:            # "Vectorized" loop, using a 8*8 vector (a whole scalar register actually).
	ld	a3,0(a5)
	addi	a5,a5,8
	addi	a4,a4,8
	sd	a3,-8(a4)
	bne	a5,a2,.L2

	li	a5,65536 # Epilogue of "Vectorized" loop.
	add	a1,a1,a5
	lw	a4,-8(a1)
	add	a5,a0,a5
	sw	a4,-8(a5)
	lhu	a4,-4(a1)
	sh	a4,-4(a5)
	lbu	a4,-2(a1)
	sb	a4,-2(a5)
	ret

As we can see, the compiled code has structure just like what we can see in vectorized loop when V exists. This optimization is enabled only in -O3.
Let’s deep into the GCC GIMPLE.
Before vect:

void foo (struct A * a, struct A * b) {
  int i;
  char _1;
  unsigned int ivtmp_3;
  unsigned int ivtmp_13;

  <bb 2> [local count: 10737416]:

  <bb 3> [local count: 1063004409]:
  # i_11 = PHI <i_8(5), 0(2)>
  # ivtmp_13 = PHI <ivtmp_3(5), 65535(2)>
  _1 = b_5(D)->data[i_11];
  a_6(D)->data[i_11] = _1;
  i_8 = i_11 + 1;
  ivtmp_3 = ivtmp_13 - 1;
  if (ivtmp_3 != 0)
    goto <bb 5>; [99.00%]
  else
    goto <bb 4>; [1.00%]

  <bb 5> [local count: 1052374367]:
  goto <bb 3>; [100.00%]

  <bb 4> [local count: 10737416]:
  return;
}

After vect:

void foo (struct A * a, struct A * b) {
  char * vectp_a.10;
  vector(8) char * vectp_a.9;
  vector(8) char vect__1.8;
  char * vectp_b.7;
  vector(8) char * vectp_b.6;
  unsigned int tmp.5;
  int tmp.4;
  int i;
  char _1;
  unsigned int ivtmp_3;
  unsigned int ivtmp_13;
  unsigned int ivtmp_14;
  char _15;
  unsigned int ivtmp_18;
  unsigned int ivtmp_26;
  unsigned int ivtmp_27;

  <bb 2> [local count: 10737416]:

  <bb 3> [local count: 139586405]:
  # i_11 = PHI <i_8(5), 0(2)>
  # ivtmp_13 = PHI <ivtmp_3(5), 65535(2)>
  # vectp_b.6_20 = PHI <vectp_b.6_21(5), b_5(D)(2)>
  # vectp_a.9_23 = PHI <vectp_a.9_24(5), a_6(D)(2)>
  # ivtmp_26 = PHI <ivtmp_27(5), 0(2)>
  vect__1.8_22 = MEM <vector(8) char> [(char *)vectp_b.6_20];
  _1 = b_5(D)->data[i_11];
  MEM <vector(8) char> [(char *)vectp_a.9_23] = vect__1.8_22;
  i_8 = i_11 + 1;
  ivtmp_3 = ivtmp_13 - 1;
  vectp_b.6_21 = vectp_b.6_20 + 8;
  vectp_a.9_24 = vectp_a.9_23 + 8;
  ivtmp_27 = ivtmp_26 + 1;
  if (ivtmp_27 < 8191)
    goto <bb 5>; [92.31%]
  else
    goto <bb 7>; [7.69%]

  <bb 5> [local count: 128848989]:
  goto <bb 3>; [100.00%]

  <bb 7> [local count: 10737416]:

  <bb 8> [local count: 1063004409]:
  # i_10 = PHI <i_17(9), 65528(7)>
  # ivtmp_14 = PHI <ivtmp_18(9), 7(7)>
  _15 = b_5(D)->data[i_10];
  a_6(D)->data[i_10] = _15;
  i_17 = i_10 + 1;
  ivtmp_18 = ivtmp_14 - 1;
  if (ivtmp_18 != 0)
    goto <bb 9>; [99.00%]
  else
    goto <bb 4>; [1.00%]

  <bb 9> [local count: 1052374367]:
  goto <bb 8>; [100.00%]

  <bb 4> [local count: 10737416]:
  return;
}

We are using vector(8) here. And then these vectors are expanded to DI type:

(code_label 17 11 12 4 2 (nil) [1 uses])
(note 12 17 13 4 [bb 4] NOTE_INSN_BASIC_BLOCK)
(insn 13 12 14 4 (set (reg:DI 74 [ vect__1.8 ])
        (mem:DI (reg:DI 77 [ ivtmp.25 ]) [0 MEM <vector(8) char> [(char *)_53]+0 S8 A64])) "../reduced.c":9:27 -1
     (nil))
(insn 14 13 15 4 (set (mem:DI (reg:DI 75 [ ivtmp.28 ]) [0 MEM <vector(8) char> [(char *)_54]+0 S8 A64])
        (reg:DI 74 [ vect__1.8 ])) "../reduced.c":9:18 -1
     (nil))
(insn 15 14 16 4 (set (reg:DI 77 [ ivtmp.25 ])
        (plus:DI (reg:DI 77 [ ivtmp.25 ])
            (const_int 8 [0x8]))) -1
     (nil))
(insn 16 15 18 4 (set (reg:DI 75 [ ivtmp.28 ])
        (plus:DI (reg:DI 75 [ ivtmp.28 ])
            (const_int 8 [0x8]))) -1
     (nil))
(jump_insn 18 16 19 4 (set (pc)
        (if_then_else (ne (reg:DI 77 [ ivtmp.25 ])
                (reg:DI 80 [ _61 ]))
            (label_ref 17)
            (pc))) -1
     (int_list:REG_BR_PROB 991146302 (nil))
 -> 17)

I don’t know much about GCC’s vectorizer, so I don’t know which part (unroll and SLP? or simply loop vectorizer?) does this.


For LLVM, we can’t vectorize this loop since <8 x i8> is not legal vector type when there is no V specified.

SLP: Didn't find any vector registers for target, abort.

LoopVectorizePass::runImpl

  // Don't attempt if
  // 1. the target claims to have no vector registers, and
  // 2. interleaving won't help ILP.
  //
  // The second condition is necessary because, even if the target has no
  // vector registers, loop vectorization may still enable scalar
  // interleaving.
  if (!TTI->getNumberOfRegisters(TTI->getRegisterClassForType(true)) &&
      TTI->getMaxInterleaveFactor(ElementCount::getFixed(1)) < 2)
    return LoopVectorizeResult(false, false);
1 Like

Based on that example, gcc recognizes the pattern as memove. llvm does not. Adding a restrict to the pointers gets llvm to recognize it as memcpy.

I think gcc alias analysis sees that the two struct pointers either must be the same or must be non-overlapping.

LLVM/clang alias analysis does not get any information about the struct or the array. So we conservatively assume there could be some overlap.

That’s probably not the key point. My example may be misleading, so I added -fno-builtin to disable this.
And, we don’t see the calls of memcpy in @hsadeghi 's assemblies. OK, that’s because the assemblies are disassembled from object file I think.
The main difference between LLVM and GCC is:
LLVM

lb a3, 0(a1)
addi (c.addi) a1, a1, 1
addi a5, a4, 1
sb a3, 0(a4)
mv (c.mv) a4, a5
bltu a5, a2, -16

GCC

ld (c.ld) a2, 0(a4)
addi (c.addi) a4, a4, 8
addi (c.addi) a3, a3, 8
sd a2, -8(a3)
bne a4, a1, -10

The loops use different bandwidths.

sorry for the inconvenience, it is because the assemblies are disassembled from the object file

Thanks for your comment.
I had already tried “restrict” which had no affect

Thanks for the explanation and the code example, I think this makes sense. So from your comment I would conclude that I need to implement an opt pass that identifies memset and memmove patterns and optimizes them by using a wider datatype like GCC