Finding PC address of a given machine instruction

Is it at all possible to find what the final address of an emitted instruction in an ELF executable is going to be from an LLVM backend? I need to be able to do this during the first pass of code emission as I’m also relying on information passed down from previous steps of compilation that are lost when simply emitting assembly or binary alone.

I imagine this might be possible if done at the linking as part of LTO. Are there any resources on how to write a pass that can executes at this point, and how to get the information I need?

Even with LTO, address assignment comes after code generation; LTO doesn’t let you know more about the binary layout at code generation time, just the full set of code present instead of a single translation unit.

The last passes that you can write are on MIR:
https://llvm.org/docs/MIRLangRef.html
It is still SSA. The layout of the binary file will happen later.

LTO happens on LLVM IR that is even earlier.

LLVM MC will write the object file for you, but there are no passes.

It stops being SSA as part of the regalloc pipeline, so there are many non-SSA passes after that

So do you think within somewhere like the instruction encoder of LLVM MC I might be able to get a final address? Do you know if you still have access to machine instruction objects at this point too?

Hi, I’m also trying to do the same. Did you get a solution to this?

No.

Addresses in an executable cannot be known until the linker is done processing the relocatable object that the compiler produces. You can find out the address of an instruction relative to a function entry point, but not what it will become in the final executable. The linker determines that address.

This question comes up often enough that I have to wonder: Why do want to know? What do you think you could do once you have that address? If you explained the goal, maybe someone could help you get to that goal without knowing something that a compiler backend inherently cannot know.

Hey, yea looked into this more and the answer is ‘kinda’.

I’ve at least found the relevant code for emitting object files. Unfortunately, MCIsnt objects are encoded into pure bytes fairly early on in the code gen process, so they get really hard to track. Instructions are emitted as bytes to objects called data fragments, which carry a payload of raw bytes. This happens in MCObjectStreamer::emitInstToFragment. These fragments can then be merged with other fragments MCELFStreamer::mergeFragment, and are attached to MCSection objects. These sections then have their final layout decided in MCAssembler::layout, called in MCAssembler::finish just before the actual object file is written to disk on the next line. There are probably some more steps here that could change the order of those raw bytes I’m missing, but those three classes should have them all somewhere (combined with whatever your specific backend wants to do too).

So, you do have where the instruction is initially encoded, and you could theoretically trace everywhere that the data fragment could be manipulated and track the offset of your instruction of interest in the bytes buffer. That’s pretty intense though, and is just the object file! We haven’t even emitted an executable yet.

I know less about how to track things with linking the object files into an executable, but I imagine it might actually be more straight forward - I think reading the relocations solves a lot for you by default? But you’d have to look into that yourself.

The way I eventually solved my problem was by avoiding this mess and doing something very hacky, which I can get away with because I’m doing a PhD project. I’m only interested in tracking load instructions, so I’ve introduced alternate load instruction definitions into tablegen (like, LDR_2) where the bitpattern are all 1’s, but the opcode object itself still has the semantics of a load. You can write a python script to parse the backend code and just insert your new opcode in under the same switch statement cases as it’s original opcode name, so you end up with stuff like:

case AARCH64::LDRXroX:
case AARCH64::LDRX2roX:
return is_a_load;

This lets you have a garbage opcode that’s still understood in the same way as the original instruction by the backend you’re messing with.

Once you have this, you compile with your pass enabled to flip any instructions of interest to their alternate opcode variants. You then parse the resulting executable (I suggest the r2pipe python package for this) searching for instructions made up of 32 1’s, save the addresses, then recompile the program with your opcode flipping disabled (or to be extra safe you can change the bit patterns to match their original opcodes again). You now have a list of instruction addresses + a regular compiled binary! I promise it’s easier than it sounds aha.

If you don’t care about knowing the addresses in the compiler (which it was sounding like OP did) you could also emit labels in front of the instructions you care about, and those label symbols can be in the executable. As long as you don’t use local labels that the linker will strip.

But you still can’t know the addresses in the backend.

And if the code ends up on a sharable image (.so) then even the linker doesn’t know the final absolute address. It would be the loader that puts that loads the code into memory and does a bunch of fixups for address constants.