RFC: Machine Level IR text-based serialization format

Hi all,

I would like to propose a text-based, human readable format that will be used to
serialize the machine level IR. The major goal of this format is to allow LLVM
to save the machine level IR after any code generation pass and then to load
it again and continue running passes on the machine level IR. The primary use case
of this format is to enable easier testing process for the code generation passes,
by allowing the developers to write tests that load the IR, then invoke just a
specific code gen pass and then inspect the output of that pass by checking the
printed out IR.


The proposed format has a number of key features:
- It stores the machine level IR and the optional LLVM IR in one text file.
- The connections between the machine level IR and the LLVM IR are preserved.
- The format uses a YAML based container for most of the data structures. The LLVM
  IR is embedded in the YAML container.
- The format also uses a new, text-based syntax to serialize the machine instructions.
  The instructions are embedded in YAML.

This is an incomplete example of a YAML file containing the LLVM IR, the machine level IR
and the instructions:

Looks good. How are you planning to "assemble" the MI-level YAML description into the actual in-memory IR?

-Krzysztof

Hi Alex,

Thanks for working on this.

Personally I would rather not have to write YAML inputs but instead resort on the what the machine dumps look like. That being said, I can live with YAML :).

More importantly, how do you plan to report syntax errors to the users?
Things like invalid instruction, invalid registers, etc.?
What about unallocated code, i.e., virtual registers, invalid SSA form, etc.?

Cheers,
Q.

From: "Alex L" <arphaman@gmail.com>
To: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
Sent: Tuesday, April 28, 2015 11:56:42 AM
Subject: [LLVMdev] RFC: Machine Level IR text-based serialization
format

Hi all,
I would like to propose a text-based, human readable format that will
be used to serialize the machine level IR. The major goal of this
format is to allow LLVM to save the machine level IR after any code
generation pass and then to load it again and continue running
passes on the machine level IR. The primary use case of this format
is to enable easier testing process for the code generation passes,
by allowing the developers to write tests that load the IR, then
invoke just a specific code gen pass and then inspect the output of
that pass by checking the printed out IR.

The proposed format has a number of key features: - It stores the
machine level IR and the optional LLVM IR in one text file. - The
connections between the machine level IR and the LLVM IR are
preserved. - The format uses a YAML based container for most of the
data structures. The LLVM IR is embedded in the YAML container. -
The format also uses a new, text-based syntax to serialize the
machine instructions. The instructions are embedded in YAML.
This is an incomplete example of a YAML file containing the LLVM IR,
the machine level IR and the instructions:
--- ir: | define i32 @fact(i32 %n) { %1 = alloca i32, align 4 store
i32 %n, i32* %1, align 4 %2 = load i32, i32* %1, align 4 %3 = icmp
eq i32 %2, 0 br i1 %3, label %10, label %4
; <label>:4 ; preds = %0 %5 = load i32, i32* %1, align 4 %6 = sub nsw
i32 %5, 1 %7 = call i32 @fact(i32 %6) %8 = load i32, i32* %1, align
4 %9 = mul nsw i32 %7, %8 br label %10
; <label>:10 ; preds = %0, %4 %11 = phi i32 [ %9, %4 ], [ 1, %0 ] ret
i32 %11 }
... --- number: 0 name: fact alignment: 4 regInfo: .... frameInfo:
.... body: - bb: 0 llbb: '%0' successors: [ 'bb#2', 'bb#1' ]
liveIns: [ '%edi' ] instructions: - 'push64r undef %rax, %rsp, %rsp'
- 'mov32mr %rsp, 1, %noreg, 4, %noreg, %edi'

Hi Alex,

I think this looks promising. What are the 1 an 4 above? How are you proposing to serialize operand flags (dead, etc.)?

-Hal

Looks good. How are you planning to "assemble" the MI-level YAML
description into the actual in-memory IR?

I plan on developing a parser for the new text format for the machine
instructions. This parser will parse instructions, operands and memory
operands,
and it will after run the machine function and the embedded LLVM IR are
parsed, so that the references to the basic blocks, constant pools,
frame indices, etc. can be resolved immediately. Each string literal in a
list of instructions in a machine basic blocks will be parsed using this
parser
and then they will be assembled together into a list of instructions for
that basic block.

I hope that answers your question,
Alex.

Partly. :slight_smile:
I'm wondering what support you would need from each target. Obviously you'd need to be able to parse the mnemonics and the register names, but that's probably doable without additional target-specific support.

Also, is this going to support SSA and post-SSA/post-RA code?

-Krzysztof

Hi Alex,

Thanks for working on this.

Personally I would rather not have to write YAML inputs but instead resort
on the what the machine dumps look like. That being said, I can live with
YAML :).

More importantly, how do you plan to report syntax errors to the users?
Things like invalid instruction, invalid registers, etc.?
What about unallocated code, i.e., virtual registers, invalid SSA form,
etc.?

Cheers,
Q.

Thanks,

Unfortunately, the machine dumps are quite incomplete (and tricky to parse
too!), and thus some sort of new syntax has to be developed.
I think that a YAML based container is a good candidate for this purpose,
as it has a structured format that represents things like machine functions,
frame information, register information, target specific machine function
details, etc in a clear and readable way.

I haven't thought about error reporting that much, as I've been mostly
working on developing the syntax and making sure that all the data
structures
can be represented by it. But I believe that the errors that crop up in an
invalid machine instruction syntax, like invalid basic block references,
invalid instructions,
etc. can be reported quite well and I can rely on already existing error
reporting facilities in LLVM to help me. The more structural errors, like
missing attributes
will be handled by the YAML parser automatically, and I might extend it to
provide better/more specific error messages. And I think that it's possible
to use the machine verifier to catch the other errors that you've mentioned.

Alex

YAML is what is suggested in the FIXME for the textual Machine IR, so
that might be the motivation behind Alex's choice.

I sort of agree that it could be better to go with a "proprietary"
format based off of the dumps. This means that a dedicated Machine
IR parser could be implemented for the purposes of library users who
want to open the files. I also think that the dumps are much easier
to diff and read.

There are parts of the suggested YAML format that seem to require some
parsing anyway, like the instruction strings. If YAML is going to be used,
I think it would be better to let the instructions be encoded in YAML
instead of leaving them as a string, if that makes sense.

/ Bevin

------------------------------

*From: *"Alex L" <arphaman@gmail.com>
*To: *"LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
*Sent: *Tuesday, April 28, 2015 11:56:42 AM
*Subject: *[LLVMdev] RFC: Machine Level IR text-based serialization format

Hi all,

I would like to propose a text-based, human readable format that will be used to

serialize the machine level IR. The major goal of this format is to allow LLVM

to save the machine level IR after any code generation pass and then to load

it again and continue running passes on the machine level IR. The primary use case

of this format is to enable easier testing process for the code generation passes,

by allowing the developers to write tests that load the IR, then invoke just a

specific code gen pass and then inspect the output of that pass by checking the

printed out IR.

The proposed format has a number of key features:

- It stores the machine level IR and the optional LLVM IR in one text file.

- The connections between the machine level IR and the LLVM IR are preserved.

- The format uses a YAML based container for most of the data structures. The LLVM

  IR is embedded in the YAML container.

- The format also uses a new, text-based syntax to serialize the machine instructions.

  The instructions are embedded in YAML.

This is an incomplete example of a YAML file containing the LLVM IR, the machine level IR

and the instructions:

---

ir: |

  define i32 @fact(i32 %n) {

    %1 = alloca i32, align 4

    store i32 %n, i32* %1, align 4

    %2 = load i32, i32* %1, align 4

    %3 = icmp eq i32 %2, 0

    br i1 %3, label %10, label %4

  ; <label>:4 ; preds = %0

    %5 = load i32, i32* %1, align 4

    %6 = sub nsw i32 %5, 1

    %7 = call i32 @fact(i32 %6)

    %8 = load i32, i32* %1, align 4

    %9 = mul nsw i32 %7, %8

    br label %10

  ; <label>:10 ; preds = %0, %4

    %11 = phi i32 [ %9, %4 ], [ 1, %0 ]

    ret i32 %11

  }

...

---

number: 0

name: fact

alignment: 4

regInfo:

  ....

frameInfo:

  ....

body:

  - bb: 0

    llbb: '%0'

    successors: [ 'bb#2', 'bb#1' ]

    liveIns: [ '%edi' ]

    instructions:

      - 'push64r undef %rax, %rsp, %rsp'

      - 'mov32mr %rsp, 1, %noreg, 4, %noreg, %edi'

Hi Alex,

I think this looks promising. What are the 1 an 4 above? How are you
proposing to serialize operand flags (dead, etc.)?

-Hal

Hi Hal,

The 1 and 4 above are constants that are specific to x86 memory addressing,
I believe they basically compute the address RSP + 1 * 0 + 4.
I haven't settled on a final version of the operand flags (for registers)
syntax, but at the moment I'm thinking of something like this:
- The IsDef flag is implied by the use of the register before the '=',
unless it's implicit.
- TiedTo and IsEarlyClobber aren't not serialized, as they are defined by
the instruction description. (I believe that's true in all cases, but I'm
not 100% sure).
- IsUndef, IsImp, IsKill, IsDead, IsInternalRead, IsDebug - keywords like
'implicit', 'undef', 'kill', 'dead' are used before the register e.g.
'undef %rax', 'implicit-def kill %eflags'.

I don't have a syntax for the SubReg_TargetFlags at the moment.

Alex

I hope that answers your question,

Partly. :slight_smile:
I'm wondering what support you would need from each target. Obviously
you'd need to be able to parse the mnemonics and the register names, but
that's probably doable without additional target-specific support.

Yes, the mnemonics and register names should be fairly straight forward.

However, there are several target specific data structures that a machine
function might have, like the MachineFunctionInfo.
The MachineFunctionInfo is particularly interesting, as it can be difficult
to serialize on certain targets, like Mips, XCore and Hexagon,
but luckily the other targets have a pretty simple subclass of
MachineFunctionInfo. I think that each target's MachineFunctionInfo
and other similar classes would have to be extended to contain the
intrusive methods for serialization.

The instructions themselves don't have too many target specific stuff, but
they do have a couple of things. Although there are some
target specific things that don't even need to be serialized - like the
MipsCallEntry, which can be used in a MachineMemOperand,
but doesn't contain any data when LLVM is compiled in release mode.

Alex.

Also, is this going to support SSA and post-SSA/post-RA code?

Yes, it's going to support both SSA and post SSA code.

Initially I was thinking about developing a text-based format that's not
based on YAML, but is closer in spirit to the LLVM IR. However, I found
that a structured format like YAML lends itself quite well to the machine
level IR. At the same time the instructions themselves don't work that well
with YAML, thus I decided on this hybrid approach. Therefore I don't
think that instructions should be in YAML, as they would just get too
verbose.

I understand that a non YAML format has its own advantages and may be
preferred by the majority. If the community decides that another format is
better,
I would be happy to work on that.

Alex.

I think the YAML hybrid makes a sort of sense.

The instructions need to be dense in order to keep the format readable.

Machine functions and basicblocks have loads of side table data
(MachineModuleInfo, MachineFunctionInfo, the target-specific info, etc)
that needs to be serialized in order to write interesting tests. Therefore,
it's better use an easily extensible format for them.

Since the instruction format is partially based on the machine dump format,
you could use something similar to that, like '%reg:subreg'.

On an tangential note, IIRC the machine dumps store the virtual register
information (register class, mainly) in-band at the end of the instruction.
Based on the format you described, I'm assuming this is what would be stored
out-of-band in 'regInfo'.

/ Bevin

I love the idea of having some sort of textual representation. My only concern is that our YAML parser is not very actively maintained (is there someone expert with its implementation and active in the project?) and (IMHO) over-engineered when compared to the simplicity of our custom IR parser.

Without TLC, I’m afraid it would make for a poor piece of LLVM infrastructure to rely on. The reliability of the serialization mechanism is very important if we are to have any chance of applying fuzz testing to the backend pieces; after all, testability is a huge motivation for this work.

As a concrete example, a file solely containing ‘%’ crashes the yaml parser:

$ ~/llvm/Debug+Asserts/bin/yaml2obj -format=coff t.yaml
yaml2obj: ~/llvm/src/lib/Support/YAMLTraits.cpp:78: bool llvm::yaml::Input::setCurrentDocument(): Assertion `Strm->failed() && “Root is NULL iff parsing failed”’ failed.
0 yaml2obj 0x000000000048682e
1 yaml2obj 0x0000000000486b43
2 yaml2obj 0x000000000048570e
3 libpthread.so.0 0x00007f5e79643340
4 libc.so.6 0x00007f5e78c9acc9 gsignal + 57
5 libc.so.6 0x00007f5e78c9e0d8 abort + 328
6 libc.so.6 0x00007f5e78c93b86
7 libc.so.6 0x00007f5e78c93c32
8 yaml2obj 0x000000000045f378
9 yaml2obj 0x000000000040d4b3
10 yaml2obj 0x000000000040b0fa
11 yaml2obj 0x0000000000404a79
12 yaml2obj 0x0000000000404dd8
13 libc.so.6 0x00007f5e78c85ec5 __libc_start_main + 245
14 yaml2obj 0x0000000000404879
Stack dump:
0. Program arguments: ~/llvm/Debug+Asserts/bin/yaml2obj -format=coff t.yaml

To get this out first: I'd love to have a way to serialize machine-IR! I often spend a lot of time trying to create .ll files in a way that the machine-IR still looks a certain way when it finally hits the relevant passes in codegen. It would be so much easier to just specify the machine IR immediately before the pass I'm interested in.

For that use case it is worth keeping the following things in mind:
- Please try to keep the output of the various dump functions, esp. MachineInstr::dump(), MachineOperand::dump(), MachineBasicBlock::dump() as close as possible to the format you use for serializing. It would be unnecessary confusing to have the dump()s while I debug different from what I can read in a textfile. Having said that you don't necessarily have to change your serialization format to be like the dump() functions, you may just as well adjust the dump() functions - just avoid them being different without reason. I can also imagine that the serialization shows a bit less information in cases where the information which is obvious in a serialization context but not when dump()ing a piece in isolation.
- Design the format in a way that makes it easy for humans to create it. If the only way to produce these files reliably is by dumping existing machine-ir I will have a hard time designing minimal and easy to understand testcases. By that I mean mostly the possibility to leave out information that can be inferred or guessed, so the resulting test is compact and shows what it is about. Just looking at your example below there is a lot of information that is redundant or which could be filled in by sensible defaults: the function "number", the basic block number, predecessors and successors of a basic block, maybe allowing to leave out the llvm IR (though that probably is not allowed by CodeGen at the moment).

- Matthias

I love the idea of having some sort of textual representation. My only
concern is that our YAML parser is not very actively maintained (is there
someone expert with its implementation *and* active in the project?) and
(IMHO) over-engineered when compared to the simplicity of our custom IR
parser.

Without TLC, I'm afraid it would make for a poor piece of LLVM
infrastructure to rely on. The reliability of the serialization mechanism
is very important if we are to have any chance of applying fuzz testing to
the backend pieces; after all, testability is a huge motivation for this
work.

This is a valid concern. I agree with you on that it is somewhat
over-engineered, and its traits based API that
tries to hide the differences between the input and output can become a bit
of a hindrance when serializing
certain complicated data structures in MC IR, as they have to be converted
to and from structures that
the API can process. That said, I think that it does work rather well for
the majority of the data structures and attributes.

It also has been pretty reliable for me so far, but it probably does need
some improvement. I've been working on certain missing
features in LLVM's YAML support and I can definitely look into making the
YAML API more stable as well.

I think Nick Kledzik implemented most of the YAML stuff btw.

To get this out first: I'd love to have a way to serialize machine-IR! I
often spend a lot of time trying to create .ll files in a way that the
machine-IR still looks a certain way when it finally hits the relevant
passes in codegen. It would be so much easier to just specify the machine
IR immediately before the pass I'm interested in.

For that use case it is worth keeping the following things in mind:
- Please try to keep the output of the various dump functions, esp.
MachineInstr::dump(), MachineOperand::dump(), MachineBasicBlock::dump() as
close as possible to the format you use for serializing. It would be
unnecessary confusing to have the dump()s while I debug different from what
I can read in a textfile. Having said that you don't necessarily have to
change your serialization format to be like the dump() functions, you may
just as well adjust the dump() functions - just avoid them being different
without reason. I can also imagine that the serialization shows a bit less
information in cases where the information which is obvious in a
serialization context but not when dump()ing a piece in isolation.

Ideally the new syntax would replace the existing print/dump syntax. The
new syntax will lead to certain missing information when
this information can be inferred (e.g. the TiedTo and IsEarlyClobber
attributes for register operands that I mentioned earlier in this thread),
so maybe we could have some sort of verbose dumping option where absolutely
everything is dumped.
The syntax does try to be kind of similar to the current format, but at the
same time it tries to be more parser and human friendly as well.

- Design the format in a way that makes it easy for humans to create it.
If the only way to produce these files reliably is by dumping existing
machine-ir I will have a hard time designing minimal and easy to understand
testcases. By that I mean mostly the possibility to leave out information
that can be inferred or guessed, so the resulting test is compact and shows
what it is about. Just looking at your example below there is a lot of
information that is redundant or which could be filled in by sensible
defaults: the function "number", the basic block number, predecessors and
successors of a basic block, maybe allowing to leave out the llvm IR
(though that probably is not allowed by CodeGen at the moment).

I agree, one of my goals is to try to make it minimal and leave out things
where it makes sense to do so.
I plan on making a lot of the YAML attributes optional, so that the user
won't necessarily have to specify them,
and the parser will set the attributes to some predetermined default values
or will try to infer them. I will
present my plans for those optional attributes in data structures when I
will send out patches that serialize
the specific data structures, so feel free to check them out.

The LLVM IR is optional by the way, but a lot of passes will probably crash
if you don't include it :wink:

I love the idea of having some sort of textual representation. My only
concern is that our YAML parser is not very actively maintained (is there
someone expert with its implementation *and* active in the project?) and
(IMHO) over-engineered when compared to the simplicity of our custom IR
parser.

Without TLC, I'm afraid it would make for a poor piece of LLVM
infrastructure to rely on. The reliability of the serialization mechanism
is very important if we are to have any chance of applying fuzz testing to
the backend pieces; after all, testability is a huge motivation for this
work.

As a concrete example, a file solely containing '%' crashes the yaml
parser:
$ ~/llvm/Debug+Asserts/bin/yaml2obj -format=coff t.yaml
yaml2obj: ~/llvm/src/lib/Support/YAMLTraits.cpp:78: bool
llvm::yaml::Input::setCurrentDocument(): Assertion `Strm->failed() && "Root
is NULL iff parsing failed"' failed.
0 yaml2obj 0x000000000048682e
1 yaml2obj 0x0000000000486b43
2 yaml2obj 0x000000000048570e
3 libpthread.so.0 0x00007f5e79643340
4 libc.so.6 0x00007f5e78c9acc9 gsignal + 57
5 libc.so.6 0x00007f5e78c9e0d8 abort + 328
6 libc.so.6 0x00007f5e78c93b86
7 libc.so.6 0x00007f5e78c93c32
8 yaml2obj 0x000000000045f378
9 yaml2obj 0x000000000040d4b3
10 yaml2obj 0x000000000040b0fa
11 yaml2obj 0x0000000000404a79
12 yaml2obj 0x0000000000404dd8
13 libc.so.6 0x00007f5e78c85ec5 __libc_start_main + 245
14 yaml2obj 0x0000000000404879
Stack dump:
0. Program arguments: ~/llvm/Debug+Asserts/bin/yaml2obj -format=coff
t.yaml

Hopefully a fuzzer that is fuzzing a yaml input would not waste its time
with syntactically invalid or unusual YAML.

Also, you're thinking of YAMLIO which is a layer on top of the YAML parser
(YAMLParser.{h,cpp}). It might make sense to not use YAMLIO (it is good for
some types of data, not for all) but still use the YAML parser.

-- Sean Silva

Maybe. I don’t see why we would want to lock ourselves out of using afl-fuzz though.

As an aside, you haven't mentioned but will the IR parser be rewritten
at all? Is the YAML a container on top of the IR?

If you are rewriting the IR parser, would it be possible to maintain
some sort of grammar?