[Machine IR] Analyzing Assembly Source Code in MIR passes

Dear LLVM developers,

My goal is to write LLVM Machine IR (MIR) passes to analyze the assembly source code. But it seems I need to find a way to translate the handwritten assembly code into MIR format first.

Is there any materials, or any modules in LLVM source code, that can help to translate assembly code into LLVM MIR for analysis?

Or is there any easier ways to analyze assembly code in MIR passes without translating it?

Best Regards,
Lele Ma

My goal is to write LLVM Machine IR (MIR) passes to analyze the assembly source code. But it seems I need to find a way to translate the handwritten assembly code into MIR format first.

Is there any materials, or any modules in LLVM source code, that can help to translate assembly code into LLVM MIR for analysis?

Or is there any easier ways to analyze assembly code in MIR passes without translating it?

MachineIR is designed for code generation, not for general assembly
representation. MIR is even not necessarily able to represent all
assembly instructions that a target's hardware supports. The
disassembler produces MCInsts, and if you wanted to go from there back
to MachineIR, you'd have to write your own target-specific code to do
so.

Cheers,
Nicolai

Llvm-mctoll will raise a binary back to LLVM IR.
Not exactly what you want but it might be something you can leverage.

https://github.com/microsoft/llvm-mctoll

Thank you for the instructions, Aaron and Nicolai!

Raising a binary to LLVM IR, or raising to MIR is a reasonable solution for me. However, given Nicolai’s information that not all target-specific instructions are representable in MIR, I got two questions that need your help:

  1. Why MIR does not necessarily represent all target specific instructions for certain hardware? If someone added those support, will this violate some design principles of MIR?

  2. Instead of IR/MIR raising, I am wondering whether a third path is possible to solve the problem of analyzing assembly code:
    - write simple LLVM pass in the MC layer to process information not available in MIR/IR and
    - passing analysis result from IR/MIR pass to the MC layer pass where we can enhance the result with missing representations.
    So the second question is whether it is possible to write passes directly in the MC layer? If so, is there any documentation or example for that?

Thank you in advance!

Best Regards,
Lele

Hi All,

A self-follow up and rephrase of my previous question with updated subject:

What I want to do is to analyze hand-written assembly code with ‘full details’ where semantics of each instruction can be known in LLVM passes. Many of such instructions have no corresponding counterparts in IR/MIR forms, such as ‘syscall’ ‘iret’, etc. At MC level, such assembly code can be translated to MCInst easily since this level is closest to the assembly code. Therefore, I am thinking to write a pass at MC level instead of IR/MIR.

However, when I am searching to learn the MC level passes, I cannot find any related classes in LLVM infrastructure (such as FunctionPass at IR level; MachineFunctionPass at MIR pass). Could anyone direct me where I should start to write a MC level pass?

Best Regards,
Lele

The MC layer doesn’t have passes. There is a method called emitIntruction() which is called one by one to create the MCInst.

In the past I have accomplished what you’d like by overloading the methods in ObjectStreamer to buffer all the MCInst for a function. Then doing analysis on the buffered instructions.

Here’s a link about how instructions are lowered which might shed some light on how all this works.

https://eli.thegreenplace.net/2012/11/24/life-of-an-instruction-in-llvm

Thank you so much! That is very helpful.

Best,
Lele