[RFC][ThinLTO] llvm-dis ThinLTO summary dump format

Hey all,

Below is the proposed format for the dump of the ThinLTO module summary in the llvm-dis utility:

…/build/bin/llvm-dis t.o && cat t.o.ll

; ModuleID = ‘2.o’
source_filename = “2.ll”
target datalayout = “e-m:e-i64:64-f80:128-n8:16:32:64-S128”
target triple = “x86_64-unknown-linux-gnu”

@X = constant i32 42, section “foo”, align 4

@a = weak alias i32, i32* @X

define void @afun() {
%1 = load i32, i32* @a
ret void
}

define void @testtest() {
tail call void @boop()
ret void
}

declare void @boop()

; Module summary:
; testtest (External linkage)
; Function (2 instructions)
; Calls: boop
; X (External linkage)
; Global Variable
; afun (External linkage)
; Function (2 instructions)
; Refs:
; a
; a (Weak any linkage)
; Alias (aliasee X)

I’ve implemented the above format in the llvm-dis utility, since there currently isn’t really a way of getting ThinLTO summaries in a human-readable format.

Let me know what you think of this format, and what information you think should be added/removed.

Thanks,
Charles

Hey all,

Below is the proposed format for the dump of the ThinLTO module summary in
the llvm-dis utility:

Thanks, Charles. A few comments/suggestions below.

> ../build/bin/llvm-dis t.o && cat t.o.ll
; ModuleID = '2.o'
source_filename = "2.ll"
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

@X = constant i32 42, section "foo", align 4

@a = weak alias i32, i32* @X

define void @afun() {
  %1 = load i32, i32* @a
  ret void
}

define void @testtest() {
  tail call void @boop()
  ret void
}

declare void @boop()

; Module summary:
; testtest (External linkage)
; Function (2 instructions)

I have a preference towards including the summary type (Function, Global
Variable, or Alias) on the same line as the value name. Also, it might be
better to group all functions together (adjacent) and all global variables
together, etc.

; Calls: boop

How will it look if there is more than one callee? Do you want to put each
on a separate line like for Refs below? This would also leave room for edge
annotations (like hotness when there is profile data).

Hey all,

Below is the proposed format for the dump of the ThinLTO module summary in
the llvm-dis utility:

The first observation I have is that this should be possibly be behind
a flag. While dumping IR summaries is useful, they may get relatively
big and sometimes people just want to look at the IR.

../build/bin/llvm-dis t.o && cat t.o.ll

; ModuleID = '2.o'
source_filename = "2.ll"
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

@X = constant i32 42, section "foo", align 4

@a = weak alias i32, i32* @X

define void @afun() {
  %1 = load i32, i32* @a
  ret void
}

define void @testtest() {
  tail call void @boop()
  ret void
}

declare void @boop()

; Module summary:
; testtest (External linkage)
; Function (2 instructions)
; Calls: boop
; X (External linkage)
; Global Variable
; afun (External linkage)
; Function (2 instructions)
; Refs:
; a
; a (Weak any linkage)

Do you need to dump the linkage for globals? You should be able to
just look at the global in the IR and get it (or, after looking at the
global, run `llvm-nm` on it). I think it's convenient, anyway having
that here, so if Teresa/others think it's fine, we might just keep it.

Why do we need a custom dumping format for the summary? Since we already need the YAML format anyway, wouldn’t it be better to extend that to cover the entire summary?

Peter

> Hey all,
>
> Below is the proposed format for the dump of the ThinLTO module summary
in
> the llvm-dis utility:
>

The first observation I have is that this should be possibly be behind
a flag. While dumping IR summaries is useful, they may get relatively
big and sometimes people just want to look at the IR.

In general of the info should be deducible from the IR, but it seems useful
to see exactly what information is in the summary. And when
dumping/disassembling the combined index files, then the linkage types
would have been adjusted by the thin link.

I don't have an issue with putting it behind an option though.

Teresa

Why do we need a custom dumping format for the summary? Since we already
need the YAML format anyway, wouldn't it be better to extend that to cover
the entire summary?

IMO it is useful/convenient to be able to see the summary in the llvm-dis
output.
Teresa

Why do we need a custom dumping format for the summary? Since we already
need the YAML format anyway, wouldn't it be better to extend that to cover
the entire summary?

IMO it is useful/convenient to be able to see the summary in the llvm-dis
output.

There are two things:
1) the format of the summary
2) where to add the dumper
I'm not too concerned about 2 (I wouldn't add it to llvm-dis, but rather as
a subcommand to llvm-lto2, but I don't have a strong opinion about that).

So that leaves the text based summary format. We need a format that is
easily parseable so that we can test components that manipulate summaries
in isolation. That is what the YAML format is. Why add a second format that
is not parseable?

Peter

Teresa

Why do we need a custom dumping format for the summary? Since we already
need the YAML format anyway, wouldn't it be better to extend that to cover
the entire summary?

IMO it is useful/convenient to be able to see the summary in the llvm-dis
output.

There are two things:
1) the format of the summary
2) where to add the dumper
I'm not too concerned about 2 (I wouldn't add it to llvm-dis, but rather
as a subcommand to llvm-lto2, but I don't have a strong opinion about that).

So that leaves the text based summary format. We need a format that is
easily parseable so that we can test components that manipulate summaries
in isolation. That is what the YAML format is. Why add a second format that
is not parseable?

So in this case it is purely for visual inspection - I suggested that
Charles add this as LLVM assembly comments, to make it clear that we can't
currently reconstruct the summary if the LLVM assembly file is re-assembled
(we currently need to rebuild the summary from scratch). I know you have
added support for inputing the type id summaries in YAML form for testing
purposes, since they were/are not being generated from IR during the
current module summary index builder, but we have the opposite issue with
the rest of the summaries (no serialization, but can build from IR).

Teresa

Why do we need a custom dumping format for the summary? Since we
already need the YAML format anyway, wouldn't it be better to extend that
to cover the entire summary?

IMO it is useful/convenient to be able to see the summary in the
llvm-dis output.

There are two things:
1) the format of the summary
2) where to add the dumper
I'm not too concerned about 2 (I wouldn't add it to llvm-dis, but rather
as a subcommand to llvm-lto2, but I don't have a strong opinion about that).

So that leaves the text based summary format. We need a format that is
easily parseable so that we can test components that manipulate summaries
in isolation. That is what the YAML format is. Why add a second format that
is not parseable?

So in this case it is purely for visual inspection - I suggested that
Charles add this as LLVM assembly comments, to make it clear that we can't
currently reconstruct the summary if the LLVM assembly file is re-assembled
(we currently need to rebuild the summary from scratch). I know you have
added support for inputing the type id summaries in YAML form for testing
purposes, since they were/are not being generated from IR during the
current module summary index builder, but we have the opposite issue with
the rest of the summaries (no serialization, but can build from IR).

Again: we could address that issue by adding the rest of the summary to the
YAML serialization, no? I don't want to have to update the tests and the
YAML dumper and the new-format dumper when I make a change to the summary
format.

Peter

Why do we need a custom dumping format for the summary? Since we
already need the YAML format anyway, wouldn't it be better to extend that
to cover the entire summary?

IMO it is useful/convenient to be able to see the summary in the
llvm-dis output.

There are two things:
1) the format of the summary
2) where to add the dumper
I'm not too concerned about 2 (I wouldn't add it to llvm-dis, but rather
as a subcommand to llvm-lto2, but I don't have a strong opinion about that).

So that leaves the text based summary format. We need a format that is
easily parseable so that we can test components that manipulate summaries
in isolation. That is what the YAML format is. Why add a second format that
is not parseable?

So in this case it is purely for visual inspection - I suggested that
Charles add this as LLVM assembly comments, to make it clear that we can't
currently reconstruct the summary if the LLVM assembly file is re-assembled
(we currently need to rebuild the summary from scratch). I know you have
added support for inputing the type id summaries in YAML form for testing
purposes, since they were/are not being generated from IR during the
current module summary index builder, but we have the opposite issue with
the rest of the summaries (no serialization, but can build from IR).

Again: we could address that issue by adding the rest of the summary to
the YAML serialization, no? I don't want to have to update the tests and
the YAML dumper and the new-format dumper when I make a change to the
summary format.

What is the option for emitting the summary in YAML (I know it is
incomplete, but just so we can see what it currently looks like)? Where
does it get emitted?
Teresa

There is no option currently, but I've attached a quick patch that adds a
dump-summary subcommand to llvm-lto2, so you can see what it looks like.
You can use it like this:

$ ra/bin/llvm-lto2 dump-summary
ra/test/ThinLTO/X86/Output/deadstrip.ll.tmp1.bc

sum.diff (1.43 KB)

Hey all,

Below is the proposed format for the dump of the ThinLTO module summary in
the llvm-dis utility:

Our format are usually both way: you have to be able to parse them back.

> ../build/bin/llvm-dis t.o && cat t.o.ll
; ModuleID = '2.o'
source_filename = "2.ll"
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

@X = constant i32 42, section "foo", align 4

@a = weak alias i32, i32* @X

define void @afun() {
  %1 = load i32, i32* @a
  ret void
}

define void @testtest() {
  tail call void @boop()
  ret void
}

declare void @boop()

; Module summary:
; testtest (External linkage)
; Function (2 instructions)
; Calls: boop
; X (External linkage)
; Global Variable
; afun (External linkage)
; Function (2 instructions)
; Refs:
; a
; a (Weak any linkage)
; Alias (aliasee X)

I've implemented the above format in the llvm-dis utility, since there
currently isn't really a way of getting ThinLTO summaries in a
human-readable format.

What about the YAML serialization?

Oh I just see that there were already a bunch of answers. I missed the thread, sorry.

So basically my intuitive approach on this is close to what I perceive Peter’s position is:

  • don’t create a new format if there is already one.
  • if you really have a good reason to create a new format, it should replace the existing one (i.e. don’t maintain two).
  • ideally we should be able to pipe the output if llvm-dis to llvm-as in a lossless manner (if our layering requires to use another tool than llvm-as, so be it).

Oh I just see that there were already a bunch of answers. I missed the
thread, sorry.

So basically my intuitive approach on this is close to what I perceive
Peter's position is:

- don't create a new format if there is already one.
- if you really have a good reason to create a new format, it should
replace the existing one (i.e. don't maintain two).
- ideally we should be able to pipe the output if llvm-dis to llvm-as in a
lossless manner (if our layering requires to use another tool than llvm-as,
so be it).

I don't disagree with these points. I was the one who suggested that
Charles implement this as a learning experience for his ThinLTO GSoC
project, because I have long wanted a simple human-readable dumper for the
summaries. I knew about the YAML support Peter added for the type test
summaries, but thought that was largely geared towards testing while those
summaries weren't automatically generated from the IR. It sounds reasonable
to have a single format, and so doing the dumping with YAML seems fine to
me. However, I would prefer to have a way to dump that into the llvm
disassembly with llvm-dis. I'd personally put a higher priority on getting
the dumping functionality out first and faster (e.g. emitting as comments
in the disassembly to indicate that it is dump only for now), and deal with
regenerating the summary from it during llvm-as later.

Teresa

That’s coherent with the part where I wrote that llvm-dis | llvm-as should not lose information :slight_smile:
So can’t we achieve this goal by appending the YAML format to the output of llvm-dis? (and teach llvm-as to recognize it?)

If it is non-trivial to adapt llvm-as, then sure: but you could also have the YAML format printed in a comment which should allow this right?

Oh I just see that there were already a bunch of answers. I missed the
thread, sorry.

So basically my intuitive approach on this is close to what I perceive
Peter's position is:

- don't create a new format if there is already one.
- if you really have a good reason to create a new format, it should
replace the existing one (i.e. don't maintain two).
- ideally we should be able to pipe the output if llvm-dis to llvm-as in
a lossless manner (if our layering requires to use another tool than
llvm-as, so be it).

I don't disagree with these points. I was the one who suggested that
Charles implement this as a learning experience for his ThinLTO GSoC
project, because I have long wanted a simple human-readable dumper for the
summaries. I knew about the YAML support Peter added for the type test
summaries, but thought that was largely geared towards testing while those
summaries weren't automatically generated from the IR. It sounds reasonable
to have a single format, and so doing the dumping with YAML seems fine to
me. However, I would prefer to have a way to dump that into the llvm
disassembly with llvm-dis.

That's coherent with the part where I wrote that `llvm-dis | llvm-as`
should not lose information :slight_smile:

It's coherent with the first part, not necessarily with piping into llvm-as
and not losing information (which would be a good follow-on, but let's get
the dumping part at least for now).

So can't we achieve this goal by appending the YAML format to the output of

llvm-dis? (and teach llvm-as to recognize it?)

I'd personally put a higher priority on getting the dumping functionality

out first and faster (e.g. emitting as comments in the disassembly to
indicate that it is dump only for now), and deal with regenerating the
summary from it during llvm-as later.

If it is non-trivial to adapt llvm-as, then sure: but you could also have
the YAML format printed in a comment which should allow this right?

Yes.

Charles - can you take a look at extending the existing YAML summary dumper
to do this? See pcc's patch on one of the emails here for how to output
that, but let's see if we can get that optionally into the llvm-dis output
as comments for now.

Looks like the existing YAML output dumps value ids, but as we discussed in
our other thread, it would be good to map those to the value names and emit
the value names instead.

Teresa

Hey Teresa,

Yep, I can do that - I’ll be clear in the output of the command that it’s just dumping for now and not meant for piping back into llvm-as. Do you want this code in llvm-dis or llvm-lto2?

Thanks,
Charles

Hey Teresa,

Yep, I can do that - I'll be clear in the output of the command that it's
just dumping for now and not meant for piping back into llvm-as. Do you
want this code in llvm-dis or llvm-lto2?

Well most of the actual code will likely live in
include/llvm/IR/ModuleSummaryIndexYAML.h. But I'd prefer an option in
llvm-dis to dump it as comments there, so there will be some small work in
llvm-dis itself. See Peter's llvm-lto2 patch he attached (which dumps to
stdout) for how to invoke the YAML dumper for the ModuleSummaryIndex -
which right now only dumps type test and whole program devirtualization
summary info.

Teresa

Ideally I'd vote for both, I'd say it should be the default output if the
bitcode contains a summary.

I know there’s been a bunch of discussion here already, but I was wondering if perhaps someone (probably Teresa? Peter?) could:

  1. summarize the current state
  2. describe the end-goal
  3. describe what steps (& how this patch relates) are planned to get to (2)

My naive thoughts, not being intimately familiar with any of this: Usually bitcode and textual IR support go in together or around the same time, and designed that way from the start (take r211920 for examaple, which added an explicit representation of COMDATs to the IR). This seems to have been an oversight in the implementation of IR summaries (is that an accurate representation/statement?) & now there’s an effort to correct that.

So it seems like that would start with a discussion of what the right end-state would be: What the syntax in textual IR should be, then implementing it. I can understand implementing such a thing in steps - it’s perhaps more involved than the COMDAT situation. In that case starting on either side seems fine - implementing the emission first (hidden behind a flag, so as not to break round-tripping in the interim) or the parsing first (no need to hide it behind any flags - manually written examples can be used as input tests).

(& it sounds like there’s some partially implemented functionality using a YAML format that was intended to address how some test cases could be written? & this might be a good basis for the syntax - but seems to me like it might be a bit disjointed/out of place in the textual IR format that’s not otherwise YAML-based?)

  • Dave