[OT] Control Flow Graph(CFG) into Abstract Syntax Tree(AST)

Hi,

I know most compilers go from AST to CFG.
I am writing a decompiler, so I was wondering if anyone knew of any
documents describing how best to get from CFG to AST.
The decompiler project is open source.
https://github.com/jcdutton/libbeauty

The decompiler already contains a disassembler and a virtual machine
resulting in an annotated CFG. It uses information gained from using a
virtual machine to annotate the CFG. Another use of the VM will be to
help analyze self modifying code.

The decompiler can output C source code form the CFG, but it is not
easy to understand the result due to lack of structure such as for {}
loops.
I wish to create an AST from the CFG in order to be able to output for
{}, while {} and if...then...else structure.
The CFG to AST step seems a little complicated, so I was looking for
some pointers to how best to do it to save me re-inventing the wheel.

Kind Regards

James

Hello,

Yours is a very interesting problem. I am also interested in this, hope someone can answer.

The general process that you are trying to do is known as "structuring control flow graphs." The basic algorithm, as far as I know it, amounts to three steps (in C terms):
1. Make switch statements
2. Make loops
3. Figure out the if/else stuff

Google finds this article: <http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=914984> for a more detailed algorithm.

Hi,

Simon Moll (in CC) has written a decompiler for LLVM in his Bachelor's Thesis here at Saarland University. The thesis is titled "Decompilation of LLVM IR" and can be found here:
http://www.cdl.uni-saarland.de/publications/

The library he implemented is called "Axtor" (for "AST Extractor") and has been used primarily to generate OpenCL code from LLVM. In this context, it successfully compiles OpenCL -> LLVM -> OpenCL "cycles".
Simon will be able to give you more information, also in terms of license details which I don't remember.

Cheers,
Ralf

I have read the pdf doc.
It covers some things well, but is missing an important piece of information.
It only needs to cover well formed CFG. I.e. Normal loops and if...then...else.
It does not need to handle "goto".
Specifically:
Section 4.1.2
1) Figure 4.2: Ir-reducible control-flow.
2) Figure 4.3: A CFG with loop-crossing branches.
3) Figure 4.4: Ab-normal selection paths.

If the original program was written in C, these types of CFG are all possible.
The only way to represent these in AST is to introduce "gotos" at
strategic points in the CFG.
I am looking at various CFG graphs, and from looking at them, it is
quite obvious to me which branches should be gotos, and which should
be loops or if...then...else. The problem I have is coming up with an
algorithm that would make the same decisions as I would.

Interesting things. I once also wanted to get structural information
from LLVM binary code. And I found it just consisted of connected
basic blocks, CFG, not ast-like tree.

What I am wondering is that, why LLVM does not include some high level
IR, before converting to RISC-like instruction. I think different work
and anlysis should be done on different level. As far as I know,
open64 compiler has several different level IR, called WHIRL.

Could anyone answer this question?

Hi James,

my thesis explicitly describes how to get rid of unstructured
control-flow by means of CFG transformations. In essence the whole
procedure consists in three steps:
1.) make the CFG reducible (Controlled Node Splitting)
2.) Make all loops single exit such that they can be modeled with
BREAK-statements (in nested loops as well).
3.) Restructure abnormal selection paths (The original thesis uses
node splitting. However, the current version relies on predication
which does not duplicate blocks).

After the transformation, all control-flow can be represented with
infinite loops, continue, break and simple if..else.. statements (w/o
short-circuit evaluation).

The transformation does its job quite well and so there isn't even an
AST-primitive for GOTOs in Axtor (the original targets (OpenCL, GLSL)
don't support it).

Anyway, extending the framework to support GOTOs should be
straight-forward: instead of transforming the CFG axtor would then
just drop a GOTO-node whenever it encounters unstructured
control-flow.

I hope i could clarify what Axtor can and can't do.

Regards,
Simon