RFC: Extending optimization reporting

Hi everyone,

We've done work on making remarks more useful for our end-users (not compiler
devs) at Sony PlayStation, and I think there's some alignment in our end goals,
though we've taken a slightly different approach so far. We've recently built a
Visual Studio extension to display remarks next to source lines, and to improve
processing time and readability we've focused on post-processing the remark
YAML. A binary format could dramatically improve processing times, while more
information and better integrated remarks would let us present richer data to
our users.

That said, a binary format wouldn't fundamentally change the nature of the
post-processing we have to do. Due to inlining, to display all the remarks that
correlate to a source line, we have to examine all of the available remarks.
For context, I've included the a description of the main steps here:

1. Scan and extract remark data from all of the YAML. We condense arguments
to two types: reference args (anything with a DebugLoc), and string args
(anything else). Sequential runs of string args are combined, so "with cost=XX
(threshold=YYY)" becomes one arg.

2. Remarks that target the same source object and describe the same action
(eg, ten "ClobberedBy" remarks on the same line) are merged; at this point we
chain the arguments together rather than performing deduplication of the text.
After this step the text is discarded and we work with our own in-memory
format.

3. Once all the remarks have been parsed, we group them by the source files
they live in. However, this is a best-effort attempt: some remarks don't
have debug locations, or don't have enough of a filename to uniquely identify
the source file.

4. Lastly, we de-duplicate the text in the remark for each arg slot. Repeated
string args get condensed, while multiple references (or, eg: different
cost/threshold text) are preserved. These remarks are packed into a by-line,
by-source-file database and written to disk.

To save on space, all strings are cached and referenced by index. We also
demangle names up front to make them human readable. Some of our large projects
generate about 3GB of YAML; with this process, the final remark database is
about ~500MB. This takes about 30 seconds in our (somewhat unoptimized) remark
compiler, or about 400,000 remarks/sec.

A lot of that time is taken simply scrubbing over YAML to perform lexical
analysis. LLVM's remark format is actually a subset of YAML that can be
understood with only a stateful lexer, but a binary format that saved on
stepping over characters would be a huge win. On the simple end, a format that
was a list of:

struct FlatString
{
  u32 size;
  char data[]; // C99 variable length struct
}

would speed up parsing remarks immensely, but there's obviously room for
improvement there. Clear hierarchical data, header information, and maybe
hashes for strings (or, even only string literals in LLVM source, the hashes
for which could be constructed at LLVM compile time) would also be big wins for
us.

To be clear, I don't believe trying to include this kind of postprocessing
phase in the LLVM toolchain is a good idea. The information required to build
our final remark database isn't available until link time or after, which
raises a lot of issues for an implementation within the toolchain. I also think
there's some separation of concerns here--the toolchain is interested in
outputting complete and correct data, but in the extension we cut corners to
improve readability and understandability (for example, we discard some almost-
duplicated cost/threshold information). The fundamental goals are different.

With regards to improving the content and coherency of remarks, there's a lot
of possible improvements but the biggest win for us would be to improve remark
to source matching. Some build systems we work with do confusing things with
intermediate files: depending where clang is invoked and the files is moved to,
it can be tricky to find the source file a remark's DebugLoc refers to. With
some files, all we see in the DebugLoc's file field is the name of the file
with no path information, which is unhelpful. With "Game.cpp" there are at
least 5 files named that in one of our big projects, and several of them have
only that in their DebugLoc. If nothing else, an extra "identifying remark"
emitted at the top of a .opt.yaml file containing the compiler invocation and
absolute path to the source file would be extremely helpful for divining the
location of files, but using absolute paths in all DebugLocs would let us match
remarks to source perfectly.

This leads me back to my second category of extensions, creating connections
between different events in the opt report

In addition to your example, we'd also like to see improvements to DebugLocs
for remarks in inlined code. In our VS extension, we turn references to source
objects into hyperlinks, anything in a remark that has a DebugLoc can be used
to jump to its declaration (as given by the DebugLoc), which we've found to be
really useful. However, one of the issues we've found with this is that some
categories of remark don't have very good information about where they come
from. For example, remarks in inlined functions don't contain information about
the function they're actually in, or the location of the function they were
inlined into. To be clear, the caller's location is never referenced, while the
callee's name isn't included--I find myself asking "what function is this code
from?" when I'm looking at the remark. That said, having spoken to members of
our team closer to LLVM, this would be difficult because that information isn't
propagated through the pipeline far enough.

I'd like to be able to either produce a report that shows just the inlining
from the LTO pass or produce a report that shows a composite of all the
inlining decisions that we made.

We took a different route with this one in response to a request to an
end-user--the extension can show different remark databases at the same time.
That way, you can compare LTO vs non-LTO builds line-by-line at a glance.
It isn't perfect, but generating a report about how inlining changed from two
disparate sets of remarks would be a logical next step.

--Will

(Hopefully this shows up in the right place; I was only subscribed to the digest.
Sorry if it doesn't thread properly!)