Hey, yea looked into this more and the answer is ‘kinda’.
I’ve at least found the relevant code for emitting object files. Unfortunately, MCIsnt objects are encoded into pure bytes fairly early on in the code gen process, so they get really hard to track. Instructions are emitted as bytes to objects called data fragments, which carry a payload of raw bytes. This happens in MCObjectStreamer::emitInstToFragment
. These fragments can then be merged with other fragments MCELFStreamer::mergeFragment
, and are attached to MCSection
objects. These sections then have their final layout decided in MCAssembler::layout
, called in MCAssembler::finish
just before the actual object file is written to disk on the next line. There are probably some more steps here that could change the order of those raw bytes I’m missing, but those three classes should have them all somewhere (combined with whatever your specific backend wants to do too).
So, you do have where the instruction is initially encoded, and you could theoretically trace everywhere that the data fragment could be manipulated and track the offset of your instruction of interest in the bytes buffer. That’s pretty intense though, and is just the object file! We haven’t even emitted an executable yet.
I know less about how to track things with linking the object files into an executable, but I imagine it might actually be more straight forward - I think reading the relocations solves a lot for you by default? But you’d have to look into that yourself.
The way I eventually solved my problem was by avoiding this mess and doing something very hacky, which I can get away with because I’m doing a PhD project. I’m only interested in tracking load instructions, so I’ve introduced alternate load instruction definitions into tablegen (like, LDR_2) where the bitpattern are all 1’s, but the opcode object itself still has the semantics of a load. You can write a python script to parse the backend code and just insert your new opcode in under the same switch statement cases as it’s original opcode name, so you end up with stuff like:
case AARCH64::LDRXroX:
case AARCH64::LDRX2roX:
return is_a_load;
This lets you have a garbage opcode that’s still understood in the same way as the original instruction by the backend you’re messing with.
Once you have this, you compile with your pass enabled to flip any instructions of interest to their alternate opcode variants. You then parse the resulting executable (I suggest the r2pipe python package for this) searching for instructions made up of 32 1’s, save the addresses, then recompile the program with your opcode flipping disabled (or to be extra safe you can change the bit patterns to match their original opcodes again). You now have a list of instruction addresses + a regular compiled binary! I promise it’s easier than it sounds aha.