RFC: Machine Level IR text-based serialization format

There’s no reason to rewrite the IR parser.

-eric

I think that the new syntax is less readable than the current format of the "dump" functions, and in the long term it would be better to have something more human-friendly. However, using YAML has the advantage that it's easier to parse it than the direct output of "dump" and so it will take less time to implement a YAML-based solution. My concern is that you may run out of time to complete this and the file format is not the most important thing in this project. Getting it to work, if only as a proof of concept, would be very helpful to everyone. Coming up with a fancier grammar and implementing a parser for it could be done later on top of the initial implementation.

-Krzysztof

Until I got to this email, I was opposed to using YAML here -- I'd
prefer a custom grammar and parser -- but I find Krzysztof's point
here pretty convincing.

Starting with a (hybrid) YAML representation seems like a reasonable
way to bootstrap a machine IR. Once it's in place and working, we
can come back and strip away the YAML parts until it's human-
friendly. (And since YAML is machine-friendly, upgrade scripts for
testcases should be straightforward.)

BTW, we probably need some sort of LangRef document for this. Maybe
docs/MIRLangRef.rst?

I love the idea of having some sort of textual representation. My only concern is that our YAML parser is not very actively maintained (is there someone expert with its implementation *and* active in the project?) and (IMHO) over-engineered when compared to the simplicity of our custom IR parser.

Without TLC, I'm afraid it would make for a poor piece of LLVM infrastructure to rely on. The reliability of the serialization mechanism is very important if we are to have any chance of applying fuzz testing to the backend pieces; after all, testability is a huge motivation for this work.

As a concrete example, a file solely containing '%' crashes the yaml parser:
$ ~/llvm/Debug+Asserts/bin/yaml2obj -format=coff t.yaml
yaml2obj: ~/llvm/src/lib/Support/YAMLTraits.cpp:78: bool llvm::yaml::Input::setCurrentDocument(): Assertion `Strm->failed() && "Root is NULL iff parsing failed"' failed.
0 yaml2obj 0x000000000048682e
1 yaml2obj 0x0000000000486b43
2 yaml2obj 0x000000000048570e
3 libpthread.so.0 0x00007f5e79643340
4 libc.so.6 0x00007f5e78c9acc9 gsignal + 57
5 libc.so.6 0x00007f5e78c9e0d8 abort + 328
6 libc.so.6 0x00007f5e78c93b86
7 libc.so.6 0x00007f5e78c93c32
8 yaml2obj 0x000000000045f378
9 yaml2obj 0x000000000040d4b3
10 yaml2obj 0x000000000040b0fa
11 yaml2obj 0x0000000000404a79
12 yaml2obj 0x0000000000404dd8
13 libc.so.6 0x00007f5e78c85ec5 __libc_start_main + 245
14 yaml2obj 0x0000000000404879
Stack dump:
0. Program arguments: ~/llvm/Debug+Asserts/bin/yaml2obj -format=coff t.yaml

Hopefully a fuzzer that is fuzzing a yaml input would not waste its time with syntactically invalid or unusual YAML.

Maybe. I don't see why we would want to lock ourselves out of using afl-fuzz though.

I don't think we're locked out of anything. We should fix bugs in the
YAML parser as we find them.

I love the idea of having some sort of textual representation. My only concern is that our YAML parser is not very actively maintained (is there someone expert with its implementation *and* active in the project?) and (IMHO) over-engineered when compared to the simplicity of our custom IR parser.

Without TLC, I'm afraid it would make for a poor piece of LLVM infrastructure to rely on. The reliability of the serialization mechanism is very important if we are to have any chance of applying fuzz testing to the backend pieces; after all, testability is a huge motivation for this work.

As a concrete example, a file solely containing '%' crashes the yaml parser:
$ ~/llvm/Debug+Asserts/bin/yaml2obj -format=coff t.yaml
yaml2obj: ~/llvm/src/lib/Support/YAMLTraits.cpp:78: bool llvm::yaml::Input::setCurrentDocument(): Assertion `Strm->failed() && "Root is NULL iff parsing failed"' failed.
0 yaml2obj 0x000000000048682e
1 yaml2obj 0x0000000000486b43
2 yaml2obj 0x000000000048570e
3 libpthread.so.0 0x00007f5e79643340
4 libc.so.6 0x00007f5e78c9acc9 gsignal + 57
5 libc.so.6 0x00007f5e78c9e0d8 abort + 328
6 libc.so.6 0x00007f5e78c93b86
7 libc.so.6 0x00007f5e78c93c32
8 yaml2obj 0x000000000045f378
9 yaml2obj 0x000000000040d4b3
10 yaml2obj 0x000000000040b0fa
11 yaml2obj 0x0000000000404a79
12 yaml2obj 0x0000000000404dd8
13 libc.so.6 0x00007f5e78c85ec5 __libc_start_main + 245
14 yaml2obj 0x0000000000404879
Stack dump:
0. Program arguments: ~/llvm/Debug+Asserts/bin/yaml2obj -format=coff t.yaml

Hopefully a fuzzer that is fuzzing a yaml input would not waste its time with syntactically invalid or unusual YAML.

Maybe. I don't see why we would want to lock ourselves out of using afl-fuzz though.

I don't think we're locked out of anything. We should fix bugs in the
YAML parser as we find them.

Agreed. FWIW, we use the YAML layer quite a lot, and we’ll definitely maintain it as needed.

>
>>
>>
>> 2015-04-28 16:26 GMT-07:00 Matthias Braun <matze@braunis.de
>> <mailto:matze@braunis.de>>:
>>
>> For that use case it is worth keeping the following things in mind:
>> - Please try to keep the output of the various dump functions, esp.
>> MachineInstr::dump(), MachineOperand::dump(),
>> MachineBasicBlock::dump() as close as possible to the format you use
>> for serializing.
>> [...]
>>
>> Ideally the new syntax would replace the existing print/dump syntax. The
>> new syntax will lead to certain missing information when
>> this information can be inferred (e.g. the TiedTo and IsEarlyClobber
>> attributes for register operands that I mentioned earlier in this
thread),
>> so maybe we could have some sort of verbose dumping option where
>> absolutely everything is dumped.
>
>
> I think that the new syntax is less readable than the current format of
the "dump" functions, and in the long term it would be better to have
something more human-friendly. However, using YAML has the advantage that
it's easier to parse it than the direct output of "dump" and so it will
take less time to implement a YAML-based solution. My concern is that you
may run out of time to complete this and the file format is not the most
important thing in this project. Getting it to work, if only as a proof of
concept, would be very helpful to everyone. Coming up with a fancier
grammar and implementing a parser for it could be done later on top of the
initial implementation.
>
> -Krzysztof

Until I got to this email, I was opposed to using YAML here -- I'd
prefer a custom grammar and parser -- but I find Krzysztof's point
here pretty convincing.

Starting with a (hybrid) YAML representation seems like a reasonable
way to bootstrap a machine IR. Once it's in place and working, we
can come back and strip away the YAML parts until it's human-
friendly. (And since YAML is machine-friendly, upgrade scripts for
testcases should be straightforward.)

I think that this would be a good approach.
I will work on the proposed YAML hybrid format for now and will begin
sending out the patches soon. Once it's working, people can evaluate it
for themselves and see if it suits them or if we need to change it to a
custom format.

BTW, we probably need some sort of LangRef document for this. Maybe
docs/MIRLangRef.rst?

That's fine with me.

Alex

What is missing in the current textual format that doesn't allow going
all the way to machine code?

Is the reason for this project because the current .LL format can't
always be put to bitcode?

What is missing in the current textual format that doesn't allow going
all the way to machine code?

Nothing.

What's missing is the ability to serialize the machine level itself.
Since many passes have to run to get from .ll to .s, it's currently
hard (impossible?) to test individual machine level passes robustly.
Having a way to serialize machine IR will let us test each pass in
isolation.

Is the reason for this project because the current .LL format can't
always be put to bitcode?

Nope, .ll and .bc can represent the same things.

Thank you.

So do you expect clients of LLVM to still continue to supply .ll files
to llvm-as?

Or will this new format be new representation?

Thank you.

So do you expect clients of LLVM to still continue to supply .ll files
to llvm-as?

Or will this new format be new representation?

The LLVM IR is still serialized as .ll and .bc. This new format is just
for better testing of the backend(s).

(For clarity, `llvm-as` is a developer tool; it shouldn't be part of a
production workflow. Production tools should use the C++ API.)

Hi all,

I would like to propose a text-based, human readable format that will be used to
serialize the machine level IR. The major goal of this format is to allow LLVM
to save the machine level IR after any code generation pass and then to load
it again and continue running passes on the machine level IR. The primary use case
of this format is to enable easier testing process for the code generation passes,
by allowing the developers to write tests that load the IR, then invoke just a
specific code gen pass and then inspect the output of that pass by checking the
printed out IR.

The proposed format has a number of key features:
- It stores the machine level IR and the optional LLVM IR in one text file.
- The connections between the machine level IR and the LLVM IR are preserved.
- The format uses a YAML based container for most of the data structures. The LLVM
  IR is embedded in the YAML container.
- The format also uses a new, text-based syntax to serialize the machine instructions.
  The instructions are embedded in YAML.

This is an incomplete example of a YAML file containing the LLVM IR, the machine level IR
and the instructions:

---
ir: |
  define i32 @fact(i32 %n) {
    %1 = alloca i32, align 4
    store i32 %n, i32* %1, align 4
    %2 = load i32, i32* %1, align 4
    %3 = icmp eq i32 %2, 0
    br i1 %3, label %10, label %4

  ; <label>:4 ; preds = %0
    %5 = load i32, i32* %1, align 4
    %6 = sub nsw i32 %5, 1
    %7 = call i32 @fact(i32 %6)
    %8 = load i32, i32* %1, align 4
    %9 = mul nsw i32 %7, %8
    br label %10

  ; <label>:10 ; preds = %0, %4
    %11 = phi i32 [ %9, %4 ], [ 1, %0 ]
    ret i32 %11
  }

...
---
number: 0
name: fact
alignment: 4
regInfo:
  ....
frameInfo:
  ....
body:
  - bb: 0
    llbb: '%0'
    successors: [ 'bb#2', 'bb#1' ]
    liveIns: [ '%edi' ]
    instructions:
      - 'push64r undef %rax, %rsp, %rsp'
      - 'mov32mr %rsp, 1, %noreg, 4, %noreg, %edi'
      - ....
        ....
  - bb: 1
    llbb: '%4'
    successors: [ 'bb#2' ]
    instructions:
      - '%edi = mov32rm %rsp, 1, %noreg, 4, %noreg'
      - ....
        ....
  - ....
    ....
...

The example above shows a YAML file with two YAML documents (delimited by `---`
and `...`) containing the LLVM IR and the machine function information for the function `fact`.

When a specific format is chosen, I'll start with patches that serialize the
embedded LLVM IR. Then I'll add support for things like machine functions and
machine basic blocks, and I think that an intrusive implementation will work best
for data structures like these. After that I will continue adding support for
serialization of the remaining data structures.

Thanks for reading through the proposal. What are you thoughts about this format?

I’m really looking forward to this; it will be extremely useful for testing the debug info backend.
For debug nodes referenced via DBG_VALUE intrinsics, it looks like they could just point to the corresponding nodes in the optional IR.
Are there any plans to represent metadata such as the DebugLoc(ations) attached to the machine instructions?

-- adrian

Yes, the debug location that's attached to the machine instruction will be
serialized as well. I will describe how when
I will send out a patch that serializes it.

Alex.