Use llvm-mc to obtain bytes of assembler?


I have some LLVM IR that are function definitions, and I would like to be able to obtain the machine code bytes literally of the body of the function of the LLVM IR. I know that because the LLVM IR statements may not one to one correspond to the machine code, and that’s ok. I just want to be able to have a function that is inline, so no arguments and no return type. And then I want to map that function to machine code bytes.

One way to do this would be to just compile the object and then disassemble it. I would like to not have to do this as I think it is dirty. So far, disassembler CLI interfaces are not really very interactive in allowing me to obtain what I need without reaching into them to use an internal API. I guess if I can’t do it with llvm MC, I will have to use capstone or something else.

In my LLVM API usage, I am able to compile down to object code. So I have a working pipeline, but I would have preferred to augment this pipeline to allow me to reach into the object programmatically. Originally, I would have liked to compile just into a buffer and then retrieve the bytes out. LLVM doesn’t make this accessible, as there are no plugins that run at machine code emission time.

I was wondering though if perhaps there is a way to pass assembler to LLVM MC and obtain the bytes that that assembler would produce. Because, then I can compile to assembler with my pipeline instead and then pass that to llvm mc.

Is there a way to do this with LLVM MC?

maybe llvm-mc --show-encoding? It will give you something like “encoding: [0x06,0x72,0xff,0xff,0x08,0x07]”

Well, I am seeing some strange output in my results. Comma separated values are sometimes things like…

[… , A, ‘A’, …

And so on. I don’t know what the random A is in there for. It should be nothing but a sequence of bytes.

I think that happens when you’re trying to assemble an instruction that needs a relocation (e.g. global variable access, or call to some function). The actual bytes can only be determined at link-time and llvm-mc represents this unknown with a placeholder ‘A’.

How can I get this to resolve all the way? To get this to resolve all the way, should I use the --disassemble mode? Can I prevent the process from needing that resolution?

Running through an assemble/disassemble pipeline will probably eliminate those unknowns by replacing them with 0 in most cases (but not all).

It’s one possible encoding that might happen in the real world, not a particuarly likely one. For some purposes that’s good enough, for others not.