Proposal: extended MDString syntax

Hello,

I propose a new syntax for metadata strings (MDStrings) which would allow string data to be spread over multiple lines:

!0 = metadata !{metadata !“”"
hello
world
“”"}

The special three-quote sequence marks the beginning and end of a multi-line string.

This syntax could have a variety of uses, but of particular interest is that it could be used as a basis for serializing MachineFunctions. MachineFunctions are not an entirely self-contained IR; they contain references to LLVM IR. As such, a serialization of a MachineFunction needs to have an LLVM Module to refer to. By encoding MachineFunctions in metadata, they can easily accompany an LLVM Module. By being multi-line, the data could be made to be reasonably human-readable and human-editable.

Attached is a simple proof-of-concept patch which implements parsing and printing for this new MDString syntax.

Comments welcome!

Dan

extended-mdstring-syntax.patch (5.12 KB)

Hi Dan,

I am not against adding multi-line support for Metadata. I am also okay with your syntax, and I am also okay with other solutions, such as escaping of \n’s or adopting c-style concatenations of strings on multiple lines. But I think that serialization of MachineFunction should not be done in metadata. I understand the problem that you pointer out that machine instructions refer to LLVM-IR, but I don’t think that metadata is the right container for them.

Nadav

Can you suggest an alternative solution? Can you describe why you don't
think metadata is the right container? This alone isn't really helpful at
moving us toward something that there has been widespread agreement LLVM
needs.

Hi Chandler,

Sure, we can talk about serializing MF. But the discussion should focus on serializing MF, and not multi-line metadata support, which is only one of the possible solutions. I understand the problem that Dan mentioned (that MF references IR), and I am sure that there are other problems that he did not mention. I would be happy to hear more about other solutions that you considered and other problems that you ran into. Have you considered using a new format that embeds LLVM-IR ?

Thanks,
Nadav

(Note, this is the first I've heard of this plan and just figured it out myself)

So inverting it so that MI contains LLVM IR instead of the other way
around? Then we'd need a serialization format for MI that happened to
include a way of serializing LLVM IR within. From a quick "hey, this
seems reasonable" the idea of embedding the MI into the IR rather than
the other way around seems to make sense since we have already have
code to serialize the IR.

The only other idea I've seen was an intern project that really didn't
go very far a few years ago of using *AML (one of them, I can't recall
which). I think Bob had some idea of finishing the project, but I'm
not sure where it's going.

Do you have any other ideas or some ideas as to why you'd prefer one
direction rather than the other?

-eric

(Note, this is the first I’ve heard of this plan and just figured it out myself)

Yes, this is also the first time I heard about this and I haven’t had a chance to think about this problem too deeply.

So inverting it so that MI contains LLVM IR instead of the other way
around? Then we’d need a serialization format for MI that happened to
include a way of serializing LLVM IR within. From a quick “hey, this
seems reasonable” the idea of embedding the MI into the IR rather than
the other way around seems to make sense since we have already have
code to serialize the IR.

The only other idea I’ve seen was an intern project that really didn’t
go very far a few years ago of using *AML (one of them, I can’t recall
which). I think Bob had some idea of finishing the project, but I’m
not sure where it’s going.

Do you have any other ideas or some ideas as to why you’d prefer one
direction rather than the other?

-eric

I think that the two alternatives that are obvious are for the MF to contain the IR, or for the IR to contain the MF. Alternatively, they can live in parallel and the MF may reference the IR. I am not sure what is the right approach here, but my gut feeling is that metadata is not necessarily the right container for MF.

Bin Zeng worked on a project as an intern last summer to serialize machine functions to yaml. At the time, we were unable to commit it to trunk because we were waiting for Nick's yamlio work to get committed. I've still got his patches and plan to commit them whenever I get a chance. I was also considering having another intern pick up that project where it left off.

The approach is perhaps similar to what Dan is proposing, just flipped around. In one scheme, the top-level container is yaml and the IR is embedded within it along with the machine function stuff. In the other, the IR is the top-level container and the machine functions are embedded as metadata. I prefer the yaml approach.

I'd be glad to reprioritize contributing the rest of Bin's patches to make those available sooner rather than later. The more interesting part, with either scheme, is how to represent the machine functions. We definitely want something that is readable but still easy to parse.

I think that the two alternatives that are obvious are for the MF to contain
the IR, or for the IR to contain the MF. Alternatively, they can live in
parallel and the MF may reference the IR. I am not sure what is the right
approach here, but my gut feeling is that metadata is not necessarily the
right container for MF.

Off the cuff I'd think that IR containing MF seems most reasonable and
the use of metadata to contain it seems to be good from two
perspectives I think:

a) it already exists,
b) oddly enough that we could get rid of the metadata and still have a
valid module/compilation unit seems like it might be interestingly
useful, but I'm not sure what uses there are off the top of my head.

That said, I really have no preference either way, just idle
speculation. Probably similar to you since we've both not thought
deeply upon it :slight_smile:

The MDString stuff does seem like it might be useful in general if
we'd like to have that though.

-eric

Bin Zeng worked on a project as an intern last summer to serialize machine functions to yaml. At the time, we were unable to commit it to trunk because we were waiting for Nick's yamlio work to get committed. I've still got his patches and plan to commit them whenever I get a chance. I was also considering having another intern pick up that project where it left off.

The approach is perhaps similar to what Dan is proposing, just flipped around. In one scheme, the top-level container is yaml and the IR is embedded within it along with the machine function stuff. In the other, the IR is the top-level container and the machine functions are embedded as metadata. I prefer the yaml approach.

Any reason? I remember the project, of course, but didn't really have
a good feel on any of the design decisions other than "hey, there's
this yaml thing". That said I don't believe I was in on the design
discussion in the first place.

I'd be glad to reprioritize contributing the rest of Bin's patches to make those available sooner rather than later. The more interesting part, with either scheme, is how to represent the machine functions. We definitely want something that is readable but still easy to parse.

At least posting them with some description and a design for how it
works and the tradeoffs could be goodness. Then they'd be out there to
look at and discussed.

-eric

The basic yaml support already exists, too. I just haven't committed the patches.

I'll try to sort out some of Bin's patches soon. They might need some updating, since yamlio has evolved since he worked with it.

Off the cuff I'd think that IR containing MF seems most reasonable and
the use of metadata to contain it seems to be good from two
perspectives I think:

a) it already exists,

b) oddly enough that we could get rid of the metadata and still have a

valid module/compilation unit seems like it might be interestingly
useful, but I'm not sure what uses there are off the top of my head.

I'll give the reason why I like this having just thought about it a while:

I think of this as a pre-lowered hint. IE, take some IR, and give a hint to
the code generator to lower like this over here. I see a few benefits of
this model:

- It makes it reasonably easy to only specify the MI for the bit you really
are trying to test. You can let the normal lowering process handle any
other bits. I think this will help keep test cases small and reasonable.

- It makes it easy to re-baseline when the code generator changes but the
changes are acceptable -- strip metadata and run it through the existing
pipeline.

- It has the potential to be "incomplete" or of varying degrees of
completeness which I think will be useful in testing different layers of
the system... but Dan probably has more/better thoughts on this front than
I do.

The one thing I don't really like about the reversed model of MI containing
IR is that now the MI model has to be "complete", so we have to invent what
that means. I'm not really interested in this outside of generating test
cases, so anything that simplifies the space of what we have to design
*really* appeals to me.

I don’t have a strong opinion either way.

I don’t understand your comment about the MI model needing to be “complete”. The yaml approach was not “MI containing IR”. In fact, the initial implementation doesn’t have good support for serializing machine instructions, but it works great to IR-level passes run by llc, e.g., codegenprepare. The yaml file is just a way of collecting the various kinds of information needed for that, and you can omit the machine instructions entirely if you want to serialize after an IR-level pass. I think all of the benefits you mention for using metadata could apply just as well to using yaml — it’s just a matter of how you stuff the data into a file.

Some other things to keep in mind;

  • There are a number of different data structures that will need to be serialized to really make this work. Besides the IR and the MachineInstructions, there are various data structures in MachineFunctions, some of which are target-specific. Yaml works well for that because it provides a nicely structured way of organizing that data. The same could be done with metadata, though.

  • One idea that Bin implemented last summer was to stash the last pass in the yaml. Unlike IR-level passes, llc has more constraints on the order in which it runs passes. We decided to just accept that limitation and assume a fixed order for the passes. We added the -stop-after option to specify where in the pass sequence to stop and serialize the code out to a yaml file. By including the name of the -stop-after pass in the yaml output, we automatically know where to start up again when processing a yaml input. There are some cases where passes are run more than once, and I don’t think we had a good solution for handling that.

I’m curious to find out if you have ideas for how to serialize the actual machine instructions. That’s where it really gets interesting, IMO.

One question I have about this, what is the use case that is being targeted here?

Micah

There are a variety of potential uses, but at a minimum, we would like to be able to run individual code-gen passes for debugging and unit testing, just like we do for IR-level passes.

I’d suggest something based on YAML which would allow you to include IR verbatim just by indenting it.

The IR module should be optional when serializing MI. The back-pointers from MI to IR are not required, and I can imagine many useful test cases that won’t need them.

module: |
  define void @linkit(i8* %source) #0 {
  entry:
    %.b243 = load i1* @Pflag, align 1
    %cond = select i1 %.b243, i32 (i8*, %struct.stat.6.13.20.64*)* @lstat, i32 (i8*, %struct.stat.6.13.20.64*)* @stat
    %call = call signext i32 %cond(i8* %source, %struct.stat.6.13.20.64* undef) #2
    ret void
  }
  @Pflag = external unnamed_addr global i1
  declare signext i32 @lstat(i8* nocapture, %struct.stat.6.13.20.64* nocapture) #1
  declare signext i32 @stat(i8* nocapture, %struct.stat.6.13.20.64* nocapture) #1

mi: |
  BB#0: derived from LLVM BB %entry
      Live Ins: %I0
  %O6<def> = SAVEri %O6, -176
  %I1<def> = SETHIi <ga:@Pflag>[TF=3]
    %I1<def> = ADDri %I1<kill>, <ga:@Pflag>[TF=4]
  %I1<def> = SLLXri %I1<kill>, 12
  %I2<def> = LDUBri %I1<kill>, <ga:@Pflag>[TF=5]; mem:LD1[@Pflag]
  %I1<def> = SETHIi <ga:@stat>[TF=3]
  %I1<def> = ADDri %I1<kill>, <ga:@stat>[TF=4]
  %I1<def> = SLLXri %I1<kill>, 12
  %I1<def> = ADDri %I1<kill>, <ga:@stat>[TF=5]
  %I3<def> = SETHIi <ga:@lstat>[TF=3]
  %I3<def> = ADDri %I3<kill>, <ga:@lstat>[TF=4]
  %I3<def> = SLLXri %I3<kill>, 12
  %I3<def> = ADDri %I3<kill>, <ga:@lstat>[TF=5]
  CMPri %I2<kill>, 0, %ICC<imp-def>
  %I1<def,tied2> = MOVXCCrr %I3<kill>, %I1<kill,tied0>, 9, %ICC<imp-use,kill>
  JMPLrr %I1<kill>, %G0, %O0<kill>, %O1<undef>, %O0<imp-def,dead>, %O1<imp-def,dead>, %ICC<imp-def,dead>, %O6<imp-use>, ...
  %O0<def> = ORrr %G0, %I0<kill>
  RET 8
  %G0<def> = RESTORErr %G0, %G0

We could also use more YAML structure to represent MI functions and basic blocks, if needed.

Thanks,
/jakob

> So inverting it so that MI contains LLVM IR instead of the other way
> around? Then we'd need a serialization format for MI that happened to
> include a way of serializing LLVM IR within. From a quick "hey, this
> seems reasonable" the idea of embedding the MI into the IR rather than
> the other way around seems to make sense since we have already have
> code to serialize the IR.

I’d suggest something based on YAML which would allow you to include IR
verbatim just by indenting it.

We can also use YAML embedded inside IR, potentially using the string
syntax Dan proposed or any other number of embedding mechanisms.

I like using YAML to represent the somewhat arbitrary datastructures of MI
so that we don't spend a lot of time inventing clever syntax for something
that has much more limited uses than the actual IR. I haven't heard anyone
really object to it.

However, I do think it's an open question as to whether to embed IR in a MI
container, or MI in an IR container. A few observations:

- No one has pointed out any really fundamental *problems* with any of the
approaches. I think both approaches can be made to work with reasonable
amounts of effort, and neither has really fundamental design problems.

- Different use cases will be more or less easy to write in different
forms. For example, Jakob's point:

The IR module should be optional when serializing MI. The back-pointers
from MI to IR are not required, and I can imagine many useful test cases
that won’t need them.

I've heard Dan and others say exactly the opposite -- that MI should be
optional. I suspect that some test cases are more MI focused, and some are
less. But I don't see either being optional as a hard prerequisite.

So, here is my concrete suggestion: if all of these approaches seem to work
and there aren't huge downsides but only reasonable tradeoffs, let the
folks writing the patches make the decision. At the moment that appears to
be Dan and maybe Bob. Is there a reason to not let them pick the design
they want to make forward progress with and run with it? I think that will
be much more productive and get us back to the important part: testing
MI-level passes.

Back-pointers from MI to LLVM IR is a hack that gets the job done, but it is not good IR design. We are already seeing the usefulness of memory operands crumble because of the stack coloring pass. Throw in something like modulo scheduling, and they will be completely wrong for alias analysis.

MI should be allowed to evolve into a proper self-contained IR that doesn’t depend on LLVM IR.

I don’t want to canonicalize this hack by encoding it in the file format we use for our tests. A container format that holds LLVM IR and MI as sibling top-level entities is much easier to gradually change towards a standalone MI IR.

Thanks,
/jakob

I would have to agree with Jakob's point here. I have to work on the MI IR, and I'm not guaranteed to have corresponding LLVM-IR. Currently I have to hack up LLVM to get this working. I would much rather have LLVM itself support a MI IR without a need to have LLVMIR around.

Micah

This is an interesting point. I tend to think of CodeGen as being an
analysis of LLVM IR, and while it can diverge somewhat, the ways in which
it diverges are usually constrained in some ways, and that leveraging
information already available in LLVM IR was practical. However, In a world
where CodeGen is doing things like restructuring loops, this seems less
practical.

Bob, I look forward to seeing the patches you have.

Thanks,

Dan

> MI should be allowed to evolve into a proper self-contained IR that doesn’t depend on LLVM IR.

This is an interesting point. I tend to think of CodeGen as being an analysis of LLVM IR, and while it can diverge somewhat, the ways in which it diverges are usually constrained in some ways, and that leveraging information already available in LLVM IR was practical.

That has traditionally been the case.

However, In a world where CodeGen is doing things like restructuring loops, this seems less practical.

I think that passes like LSR and vectorization really should be MI passes because they are so closely coupled with the subtarget. I could imagine some sort of symbiosis with the instruction selector as well.

This would require a version of MI with generic opcodes, a lot like a SelectionDAG looks before isel. (And an MI SCEV).

Thanks,
/jakob