I want to compile C code for an Risc-V chip that is similar to the RocketChip.
I’m testing to compile a simple multiplication of two global variables with this command:
clang mul.c -O1 -o mult -fuse-ld=lld -mno-relax --gcc-toolchain=~/riscv/_install/riscv64-unknown-elf --target=riscv64 -march=rv64imc -mcpu=rocket-rv64 -static -nostdlib -nostartfiles
The RocketModel which is part of the upstream llvm github repo,
has an instruction latency for multiplication of 4 cycles. The objdump:
0000000000011158 :
11158: 00012537 lui a0,0x12
1115c: 17852503 lw a0,376(a0) # 12178
11160: 000125b7 lui a1,0x12
11164: 17c5a583 lw a1,380(a1) # 1217c
11168: 02a5853b mulw a0,a1,a0
1116c: 000125b7 lui a1,0x12
11170: 18a5a023 sw a0,384(a1) # 12180
11174: 8082 ret
since the Instruction latency of mulw at 11168 is 4 cycles i expected 2 NOP’s/NOP like Instructions between “1116c lui” and “11170 sw”
but why isn’t that the case?
Latency is for performance not correctness; that is, this is for non-exposed hazards. It tells the scheduler to try and put the mulw at least 4 cycles of instructions before any uses, but there is no point adding NOPs in, the hardware will stall its pipeline as needed without them, so they’d just add code bloat, and would degrade performance when run on a processor which has a lower multiplication latency.