RFC: Extending optimization reporting

I would like to begin a discussion about updating LLVM’s opt-report infrastructure. There are some things I’d like to be able to do with optimization reports that I don’t think can be done, or at least aren’t natural to do, with the current implementation.

I understand that there is a lot of code in place already to produce optimization remarks, and one of my explicit goals is to minimize the amount of updating existing code while still enabling the new features I would like to support. I have some ideas in mind for how to achieve what I’m proposing, but I want to start out by just describing the desired results and if we can reach a consensus on these being nice things to have then we can move on to talking about the best way to get there.

I think the extensions I have in mind can broadly be organized into two categories: (1) ways to support different sorts of output, and (2) ways to create connections between different events represented in the report.

Near as I can tell, the only support we have in the code base is for YAML output. I think I could implement a new RemarkStreamer to get other formats, but nothing in the LLVM code base does that. Is that correct?

I’d like to be able to:

  • Embed some subset of optimizations remarks as annotations in the generated assembly output

  • Embed the remarks in the generated executable in a binary format consumed by the Intel Advisor tool

  • Produce text output in a format recognized by Microsoft Visual Studio or other IDEs

The last of these is probably straightfoward since it’s basically a streaming format such as the current infrastructure expects. The other two seem like they might be more complicated, since they involve keeping the information around, potentially across LTO, and correlated with the evolving IR until the final machine code or assembly is produced.

This leads me back to my second category of extensions, creating connections between different events in the opt report. My goal here is to be able to produce some kind of coherent report after compilation is complete that lets the user make some sense of how the IR evolved over the course of compilation and what effects that may have had on optimizations. This mostly has to do with the handling of loops, vectorization, and inlining.

Let’s say, for example, I’ve got code like this:

for (…)

A

if (lic)

B

C

And the loop-unswitch pass turns it into this:

if (lic)

for (…)

A; B; C

else

for (…)

A; C

Now let’s say the vectorizer for some reason is able to vectorize the loop in the else-clause but not the if-clause. (I don’t know if this kind of thing is possible with the current phase ordering, but I think this theoretical example illustrates the idea anyway.)

I want some way to produce a report that tells the user about the existence of the two loops that were created when we unswitched the loop so that we can then tell the user in some sensible way that we couldn’t vectorize one loop but that we could vectorize the second.

I’m not sure what the opt-viewer would currently do with a case like this, but what I want to avoid is getting stuck where the report we can emit essentially conveys the following not very helpful information.

for (…)

// Loop was unswitched.

// Loop could not be vectorized because…

// Loop was vectorized.

A

if (lic)

B

C

Instead I’d like to have a way to produce something like this:

for (…)

// Loop was unswitched for condition (srcloc)

// Unswitched loop version #1

// Unswitched for IF condition (srcloc)

// Loop was not vectorized:

// Unswitched loop version #2

// Loop was vectorized

The primary thing missing, I think, is a way for the vectorizer to give some indication of which version of the loop it is talking about in its optimization remarks and maybe a way for the opt-viewer to be able to make sense of that.

Likewise, there are things I want to be able to track with inlining. Let’s say we go through the inlining pass pre-LTO and we make some decisions and report them. Then during LTO we go through another round of inlining and possibly make different decisions. I’d like to be able to either produce a report that shows just the inlining from the LTO pass or produce a report that shows a composite of all the inlining decisions that we made.

We tried something like this with an inlining report before (⚙ D19397 Initial patch for inlining report), but it had the misfortune of being proposed at about the same time that the current opt-viewer mechanism was being developed and we didn’t manage to get aligned with that. I’m hoping that we can correct that now.

Hi Andrew,

I would like to begin a discussion about updating LLVM's opt-report infrastructure. There are some things I'd like to be able to do with optimization reports that I don't think can be done, or at least aren't natural to do, with the current implementation.

I understand that there is a lot of code in place already to produce optimization remarks, and one of my explicit goals is to minimize the amount of updating existing code while still enabling the new features I would like to support. I have some ideas in mind for how to achieve what I'm proposing, but I want to start out by just describing the desired results and if we can reach a consensus on these being nice things to have then we can move on to talking about the best way to get there.

I think the extensions I have in mind can broadly be organized into two categories: (1) ways to support different sorts of output, and

As you may have seen Francis’ recent work, we have some plans in this area as well. Francis is going to cover that part ...

(2) ways to create connections between different events represented in the report.

… let me cover this part.

Near as I can tell, the only support we have in the code base is for YAML output. I think I could implement a new RemarkStreamer to get other formats, but nothing in the LLVM code base does that. Is that correct?

I'd like to be able to:
- Embed some subset of optimizations remarks as annotations in the generated assembly output
- Embed the remarks in the generated executable in a binary format consumed by the Intel Advisor tool
- Produce text output in a format recognized by Microsoft Visual Studio or other IDEs

The last of these is probably straightfoward since it's basically a streaming format such as the current infrastructure expects. The other two seem like they might be more complicated, since they involve keeping the information around, potentially across LTO, and correlated with the evolving IR until the final machine code or assembly is produced.

This leads me back to my second category of extensions, creating connections between different events in the opt report. My goal here is to be able to produce some kind of coherent report after compilation is complete that lets the user make some sense of how the IR evolved over the course of compilation and what effects that may have had on optimizations. This mostly has to do with the handling of loops, vectorization, and inlining.

Let's say, for example, I've got code like this:

for (...)
    A
    if (lic)
        B
    C

And the loop-unswitch pass turns it into this:

if (lic)
    for (...)
        A; B; C
else
    for (...)
        A; C

Now let's say the vectorizer for some reason is able to vectorize the loop in the else-clause but not the if-clause. (I don't know if this kind of thing is possible with the current phase ordering, but I think this theoretical example illustrates the idea anyway.)

I want some way to produce a report that tells the user about the existence of the two loops that were created when we unswitched the loop so that we can then tell the user in some sensible way that we couldn't vectorize one loop but that we could vectorize the second.

I'm not sure what the opt-viewer would currently do with a case like this, but what I want to avoid is getting stuck where the report we can emit essentially conveys the following not very helpful information.

for (...)
  // Loop was unswitched.
  // Loop could not be vectorized because...
  // Loop was vectorized.
    A
    if (lic)
        B
    C

Instead I'd like to have a way to produce something like this:

for (...)
  // Loop was unswitched for condition (srcloc)
  // Unswitched loop version #1
  // Unswitched for IF condition (srcloc)
  // Loop was not vectorized:
  // Unswitched loop version #2
  // Loop was vectorized

The primary thing missing, I think, is a way for the vectorizer to give some indication of which version of the loop it is talking about in its optimization remarks and maybe a way for the opt-viewer to be able to make sense of that.

Likewise, there are things I want to be able to track with inlining. Let's say we go through the inlining pass pre-LTO and we make some decisions and report them. Then during LTO we go through another round of inlining and possibly make different decisions. I'd like to be able to either produce a report that shows just the inlining from the LTO pass or produce a report that shows a composite of all the inlining decisions that we made.

I think the general problem is that we’re currently missing the ability to distinguish between code versions generated for the same source code, i.e. as functions get inlined multiple times or as loops get versioned. (There is also loop-unrolling where we duplicate iterations with potentially removing the loop altogether.)

I think that for the inlined case we can walk the inlinedAt metadata to identify the code version in question. For loops we could attach a version number to the loop id. Then as long as the versioning transformation properly declares the new version, subsequent transformations can refer to code version they modify. Then on the client side we should have all the information to reconstruct what happened.

I see this as essentially extending the DebugLoc field with code version information in the remark.

This still wouldn’t handle unrolling which would require some new metadata but it would be a good start.

What do you think?

Adam

Hi Andrew,

I would like to begin a discussion about updating LLVM's opt-report infrastructure. There are some things I'd like to be able to do with optimization reports that I don't think can be done, or at least aren't natural to do, with the current implementation.

I understand that there is a lot of code in place already to produce optimization remarks, and one of my explicit goals is to minimize the amount of updating existing code while still enabling the new features I would like to support. I have some ideas in mind for how to achieve what I'm proposing, but I want to start out by just describing the desired results and if we can reach a consensus on these being nice things to have then we can move on to talking about the best way to get there.

I think the extensions I have in mind can broadly be organized into two categories: (1) ways to support different sorts of output, and (2) ways to create connections between different events represented in the report.

Near as I can tell, the only support we have in the code base is for YAML output. I think I could implement a new RemarkStreamer to get other formats, but nothing in the LLVM code base does that. Is that correct?

Yes, for now only a YAML output is supported.

The current design is the following:

The passes create a remark diagnostic and call (Machine)OptimizationRemarkEmitter::Emit. That goes through LLVMContext where the RemarkStreamer is used to handle remark diagnostics. Then in the RemarkStreamer we serialize each diagnostic to YAML through the YAMLTraits and immediately write that to the file.

One of the main ideas from the beginning of the optimization remarks is to do as less work as possible on the compiler side. We don’t want to keep remarks in memory or significantly increase compile-time because of this. Most of the work is expected to be done on the client side, with, if possible, help from LLVM libraries.

Then on the other side, I recently added a parsing infrastructure for remarks in lib/Remarks, which parses YAML using the YAMLParser, performs some semantic checks on the remarks and creates a list of remarks::Remark. This does not re-use any code from the generation side for the following reasons:

* The generation is based on LLVM diagnostics, which has its own class hierarchy.
* The diagnostics are deeply coupled with LLVM IR / MIR, and we don’t want to generate dummy IR just for parsing a bunch of remarks and displaying them in a html view (e.g. opt-viewer.py).
* The YAML generated by LLVM can’t be parsed using the YAMLTraits, because we have an unknown number of arguments that can have the same key. We use the YAML parser for this, like tools/llvm-opt-report was doing before I added the remark parser in-tree.

One main issue right now is that we don’t have a way to serialize a remarks::Remark to YAML, and if we can somehow manage to use the same abstraction we use for parsing when we’re generating remarks, that would solve a lot of issues. The main reason I haven’t looked more deeply into this is because it would require making extra copies and extra allocations (especially of the arguments) that we would like to avoid doing during generation.

I'd like to be able to:
- Embed some subset of optimizations remarks as annotations in the generated assembly output

This would require keeping remarks in memory until we reach the asm-printer. Another way to do this is to pipe the output of clang to another tool that adds these annotations based on debug info and the remark file.

- Embed the remarks in the generated executable in a binary format consumed by the Intel Advisor tool

I recently added -mllvm -remarks-section. See https://llvm.org/docs/CodeGenerator.html#emitting-remark-diagnostics-in-the-object-file for what it contains.

The model we’re planning on using on Darwin is through dsymutil, which will merge all the remark files while processing the debug info, and create a separate file in the final .dSYM bundle with all the remarks.

- Produce text output in a format recognized by Microsoft Visual Studio or other IDEs

This would be very nice! I think right now there is no easy (or clean) way to add a new format, but we should definitely work on making that easier.

I have a few patches coming with a “binary” format that we want to use, so maybe we can work on building an infrastructure that can serve YAML, the binary format, and leave room for any new formats.

I tried to make the C API on the parsing side easy to use with any other format. See llvm-c/Remarks.h.

Let me know what you think!

Thanks,

Hi Adam,

Thanks for your input.

If I understand correctly, you’re saying that we can handle the loop versioning issue by explicitly identifying new loops as they are created. So, the unswitching optimization, for example, would report that it unswitched loop-0 at source location X, creating loop-1 and loop-2, and then later the vectorizer would report that it was unable to vectorize loop-1 at source location X, and later still that it was able to vectorize loop-2 at source location X. And at each stage the optimization pass emitting the remark would get the version information from loop metadata. Is that right?

If so, I think that is in basic agreement with what I wanted to do. I’ll need to talk about it with some of my co-workers who have been thinking about ways to solve this problem, but I like this direction.

As you say, unrolled loops are still a problem. There’s basically no place to reliably hang metadata so that we can report that an optimization is working with an unrolled portion of a loop.

I’m not sure about the inlining issue. At first glance it does sound as simple as you suggest. However, my colleagues who have done a lot of work with inlining tell me there are some complications that make recovering all of the necessary data to form a coherent description of what has happened difficult. They’ve explained it to me a couple of times, but I haven’t internalized the subtleties yet. I’ll have to get back to you about this part.

-Andy

Thanks, Francis. I actually wasn’t up to date on your latest work. It sounds like you’ve laid some helpful groundwork.

I think generalizing the remark handling interface should be fairly manageable, and that’s probably a good place for me to start getting involved.

My understanding of the very high-level design is that a pass creates an optimization remark object, passes it to an optimization remark emitter, which passes it to the diagnostic handler, which passes it to a remark streamer. All of these stages appear to have pretty clean interfaces. I believe there is even a mechanism to plug in a different remark streamer, though maybe not all of the wiring to connect that to a command line option.

I expect there will be glitches that will surface along the way, but supporting something like an IDE consumable text output format seems like it should be as simple as plugging in a new remark streamer that can consume the existing optimization remark objects and produce the desired text format. But that’s at the hand-waving/QED level, right? Since you’ve started looking into supporting other formats and say there’s some infrastructure work to be done, I guess it’s not quite that easy. In any event, I’d be happy to work with you on generalizing the infrastructure.

For some of the more involved use cases I want to cover, I was thinking the biggest challenge might be in adding extra information to the remark objects to support the new formats but doing so in a way that doesn’t break the YAML streamer or require any significant changes to it.

Also, in terms of design philosophy, I completely agree with your goals of trying to minimize the compile time and memory footprint of the optimization reporting mechanism, but I think that if we want to support something that requires more memory or bigger IR it should be OK to take that hit on an opt-in basis. Do you agree?

Thanks,

Andy

Also, in terms of design philosophy, I completely agree with your goals of trying to minimize the compile time and memory footprint of the optimization reporting mechanism, but I think that if we want to support something that requires more memory or bigger IR it should be OK to take that hit on an opt-in basis. Do you agree?

I agree.

Also, I’ll add that keeping track of the different loop versions is an important idea. Regarding unrolling, I believe that we have a mechanism for encoding information on this into DWARF discriminators (see http://llvm.org/viewvc/llvm-project?rev=349973&view=rev). It would be good to have a solution here for loop versioning, we want this for both remarks and PGO.

-Hal

Hi Adam,

Thanks for your input.

If I understand correctly, you’re saying that we can handle the loop versioning issue by explicitly identifying new loops as they are created. So, the unswitching optimization, for example, would report that it unswitched loop-0 at source location X, creating loop-1 and loop-2, and then later the vectorizer would report that it was unable to vectorize loop-1 at source location X, and later still that it was able to vectorize loop-2 at source location X. And at each stage the optimization pass emitting the remark would get the version information from loop metadata. Is that right?

Exactly!

If so, I think that is in basic agreement with what I wanted to do. I’ll need to talk about it with some of my co-workers who have been thinking about ways to solve this problem, but I like this direction.

As you say, unrolled loops are still a problem. There’s basically no place to reliably hang metadata so that we can report that an optimization is working with an unrolled portion of a loop.

I’m not sure about the inlining issue. At first glance it does sound as simple as you suggest. However, my colleagues who have done a lot of work with inlining tell me there are some complications that make recovering all of the necessary data to form a coherent description of what has happened difficult. They’ve explained it to me a couple of times, but I haven’t internalized the subtleties yet. I’ll have to get back to you about this part.

Please keep me posted about this.

Thanks,
Adam

Hi Adam,

I don’t have much to report here, but I wanted to let you know that I haven’t forgotten about this completely.

I talked to Robert Cox about the inlining part of things, and he agreed that at a high level the problem we want to solve is essentially like the loop versioning problem in that we mostly need some kind of ID and a place to hang it. He was a little nervous about depending on debug information because he’s seen some things in the past where that changed how things were optimized. I guess with the way debug info is implemented in LLVM there’s a more or less continuous spectrum from just keep a couple of pieces of metadata to track source location through to full DWARF support, so any attempt to solve this with metadata that isn’t really debug info would be pointless and redundant. Speaking for myself I’m also a little concerned about the tendency of debug info to get dropped during optimization, but I guess that doesn’t happen as regularly as it used to?

I hope to be able to do more with this soon.

-Andy

Hi Andy,

“[Robert Cox] was a little nervous about depending on debug information because he’s seen some things in the past where that changed how things were optimized.”

Any such case should be considered a bug, and reported. There used to be some really egregious cases, lately we’ve found more isolated instances. There’s only one situation I’m aware of that we let ride: Generating call-frame information introduces scheduling barriers. Note that the source-location information has never (AFAIK) affected optimization, it’s the more invasive value-tracking stuff that tends to be the problem.

“I guess with the way debug info is implemented in LLVM there’s a more or less continuous spectrum from just keep a couple of pieces of metadata to track source location through to full DWARF support,”

There is a source-location-only mode intended to support optimization remarks, i.e. be able to associate a source location with a remark that you want to report. Offhand I don’t recall whether that mode supports tracking inlined scopes, but it probably could if it doesn’t now. One of my team recently fixed a bug with updating the source location on loops in a function being inlined, so we should have that distinction available now. I don’t think there’s anything in place that distinguishes copies of an unrolled loop though.

“Speaking for myself I’m also a little concerned about the tendency of debug info to get dropped during optimization, but I guess that doesn’t happen as regularly as it used to?”

Dropping location info is frequently a bug, although it’s not as hard-and-fast a rule as the one about not affecting optimization (sometimes there just isn’t a reasonable location to use; DWARF does not support associating two source locations with the same instruction). I know my team tries to fix these whenever we find them, because we are very interested in improving the quality of the debugging experience on optimized code. If you run into cases with losing/poor-quality optimization remarks, likely there is a poor debugging experience to match, so a bug report (if not a patch) would benefit everyone.

HTH,

–paulr