How to get RISCV instruction operand size

Hi there,

I am working on a RISCV disassembling project. I need to get the function parameter and return type by decoding elf file. The idea is trivial, I will walk through the instructions in a function and find registers which are used but not defined before. Then I will treat them as function parameters, some edge cases are handled of course.

I use llvm-mc to disassemble the bytecode and get instructions, but I have no idea how to determine the operand size once I get instructions. RISCV seems to use the whole register to store the data. Consider following example in C

int foo(int a, long long b){
    if(b > 0L)
        return a;
    return 0;
}

The assembly code of foo on x86 would be

0000000000001129 <foo>:
    1129: f3 0f 1e fa                  	endbr64
    112d: 55                           	pushq	%rbp
    112e: 48 89 e5                     	movq	%rsp, %rbp
    1131: 89 7d fc                     	movl	%edi, -4(%rbp)
    1134: 48 89 75 f0                  	movq	%rsi, -16(%rbp)
    1138: 48 83 7d f0 00               	cmpq	$0, -16(%rbp)
    113d: 7e 05                        	jle	0x1144 <foo+0x1b>
    113f: 8b 45 fc                     	movl	-4(%rbp), %eax
    1142: eb 05                        	jmp	0x1149 <foo+0x20>
    1144: b8 00 00 00 00               	movl	$0, %eax
    1149: 5d                           	popq	%rbp
    114a: c3                           	retq

Register %edi and %rsi indicate parameters are 32bit and 64 bit respectively.

However, corresponding RISCV assembly looks like

0000000080002000 <foo>:
80002000: 13 01 01 fe  	addi	sp, sp, -32
80002004: 23 3c 11 00  	sd	ra, 24(sp)
80002008: 23 38 81 00  	sd	s0, 16(sp)
8000200c: 13 04 01 02  	addi	s0, sp, 32
80002010: 23 24 a4 fe  	sw	a0, -24(s0)
80002014: 23 30 b4 fe  	sd	a1, -32(s0)
80002018: 83 35 04 fe  	ld	a1, -32(s0)
8000201c: 13 05 00 00  	li	a0, 0
80002020: 63 5a b5 00  	bge	a0, a1, 0x80002034 <foo+0x34>
80002024: 6f 00 40 00  	j	0x80002028 <foo+0x28>
80002028: 03 25 84 fe  	lw	a0, -24(s0)
8000202c: 23 26 a4 fe  	sw	a0, -20(s0)
80002030: 6f 00 00 01  	j	0x80002040 <foo+0x40>
80002034: 13 05 00 00  	li	a0, 0
80002038: 23 26 a4 fe  	sw	a0, -20(s0)
8000203c: 6f 00 40 00  	j	0x80002040 <foo+0x40>
80002040: 03 25 c4 fe  	lw	a0, -20(s0)
80002044: 03 34 01 01  	ld	s0, 16(sp)
80002048: 83 30 81 01  	ld	ra, 24(sp)
8000204c: 13 01 01 02  	addi	sp, sp, 32
80002050: 67 80 00 00  	ret

I cannot tell the size of a0 and a1 by just checking register. It seems the only way to get operand size in RISCV is to check the instruction (sw, sd, ld and li). But this approach needs to handle EVERY instructions seperately.

I know that register size information is stored in TargetRegisterInfo class, but the point is RISCV do not divide a register to several sub registers(like RAX, EAX, AX, AH, AL), at least for x0-x31.

So my question is, how can I get operand size in RISCV(remember, I am disassembling from bytecode to LLVM IR)?

You are mixing different concepts. A programming languages talks in terms of typed values. The machine code operates on registers of certain bit width. There is not necessarily a 1:1 relationship between both.

You get the operand size of the instruction from MCInst/MCInstDesc. But what you really want is a type which you can use in IR, and that requires you to interpret the instructions, even for x86:

Simple example:

__int128 add(__int128 a, __int128 b) {
  return a + b;
}

results in

        movq    %rdi, %rax
        addq    %rdx, %rax
        adcq    %rcx, %rsi
        movq    %rsi, %rdx
        retq

Regards,
Kai

1 Like

You are not first one:
https://blog.regehr.org/archives/2265

You could try llvm-dwarfdump or DWARF to get the function signatures back.

1 Like

Thank you for your reply!

In fact, I am only dealing with int, float and pointer. So the logic here is simple, if general purpose register x10 is used 32 bit, I am treating it as an int type. On the contrary, if f10 is used with 32 bit, I am treating it as a float. This may not be always the case, but the bytecode I am disassembling is basic machine learning code, so long long does not occur in this case, and 64 bit on x10 is treated as a pointer. To sum up, I handled the type transformation from operand size to IR type manually.

In this case, MCInst/MCInstrDesc are more related for my goal. I checked these classes, but only find size for instruction encoding, not for its operand. Please correct me if i missed anything.

class MCInstrDesc {
public:
...
  unsigned char Size;            // Number of bytes in encoding.
...
}