[RFC][llvm-mca] Adding binary support to llvm-mca.

Introduction

I would really like to add this feature to llvm-mca.

I have spoken off-line with Matt mutiple time about this feature multiple times.
I am happy with the suggested approach. Matt prototyped it, and it seems to work okay for us.

However, it would be really nice to get feedback from somebody else (not necessarily people involved in the llvm-mca project).
For example, I am interested in what people think about the whole design (i.e. the idea of introducing two new intrinsics, and generating the information in a separate section of the binary object file).

About the suggested design:

I like the idea of being able to identify code regions using a numeric identifier.
However, what happens if a code region spans through multiple basic blocks?

My understanding is that code regions are not allowed to overlap. So, it makes sense if __mca_code_region_end() doesn’t take an ID as input.

However, what if __mca_code_region_end() ends in a different basic block?

__mca_code_region_start() has to always dominate __mca_code_region_end(). This is trivial to verify when both calls are in a same basic block; however, we need to make sure that the relationship is still the same when the end() call is in a different basic block.
That would not be enough. I think we should also verify that __mca_code_region_end() always post-dominates the call to __mca_code_region_start().

My question is: what happens with basic block reordering? We don’t know the layout of basic blocks until we reach code emission. How does it work for regions that span through multiple basic blocks?. I think your RFC should clarify this aspect.

As a side note: at the moment, llvm-mca doesn’t know how to deal with branches. So, for simplicity we could force code regions to only contain instructions from a single basic block.

However, In future we may want to teach llvm-mca how to analyze branchy code too. For example, we could introduce a simple control-flow analysis in llvm-mca, and use an external “branch trace” information (for example, a perf trace generated by an external tool) to decorate branches with with branch probabilities (similarly to what we currently do in LLVM with PGO). We could then use that knowledge to model branch prediction and simulate what happens in the presence of multiple branches.

So, the idea of having regions that potentially span multiple basic blocks is not bad in general. However, I think you should better clarify what are the constraints (at least, you should answer to my questions from before).

If we decide to use those new intrinsics, then those should be experimental (at least to start).

Hi Andrea,

Thanks for your input.

[... snip ...]

About the suggested design:
I like the idea of being able to identify code regions using a numeric
identifier.
However, what happens if a code region spans through multiple basic blocks?

The current patch does not take into consideration cases where the
region start and end intrinsics are placed in different basic blocks.
Such would be the case if a region is defined to span multiple blocks.
This would be similar to the current case where a user places a
#LLVM-MCA-BEGIN assembly comment in one block and an #LLVM-MCA-END in
another. However, as you point out below, if the user does this in the
source code via intrinsics (just what this patch is proposing), then
there is a chance that optimizations might change the layout of the
instructions and confuse the ordering of the MCA intrinsics.

Since MCA does not follow branches (MCA just treats a branch as it would
a non-branching instruction), it seems that a user should be aware that
defining MCA code regions that span multiple blocks might result in an
unexpected analysis. While we do not discourage this, it seems like
such a case will probably not produce an expected result for the user.
We could introduce a warning, or automatically divide the regions so
that a single region can only contain a single block.

My understanding is that code regions are not allowed to overlap. So, it
makes sense if ` __mca_code_region_end()` doesn't take an ID as input.
However, what if ` __mca_code_region_end()` ends in a different basic block?

`__mca_code_region_start()` has to always dominate `
__mca_code_region_end()`. This is trivial to verify when both calls are in
a same basic block; however, we need to make sure that the relationship is
still the same when the `end()` call is in a different basic block.
That would not be enough. I think we should also verify that `
__mca_code_region_end()` always post-dominates the call to
`__mca_code_region_start()`.

In any case this patch should probably check dominance of the
intrinsics, even though MCA does not follow branches and MCA does not
not explicitly forbid a region from containing multiple blocks.

My question is: what happens with basic block reordering? We don't know the
layout of basic blocks until we reach code emission. How does it work for
regions that span through multiple basic blocks?. I think your RFC should
clarify this aspect.

As a side note: at the moment, llvm-mca doesn't know how to deal with
branches. So, for simplicity we could force code regions to only contain
instructions from a single basic block.

However, In future we may want to teach llvm-mca how to analyze branchy
code too. For example, we could introduce a simple control-flow analysis in
llvm-mca, and use an external "branch trace" information (for example, a
perf trace generated by an external tool) to decorate branches with with
branch probabilities (similarly to what we currently do in LLVM with PGO).
We could then use that knowledge to model branch prediction and simulate
what happens in the presence of multiple branches.

So, the idea of having regions that potentially span multiple basic blocks
is not bad in general. However, I think you should better clarify what are
the constraints (at least, you should answer to my questions from before).

I agree! Thanks for pointing that out.

If we decide to use those new intrinsics, then those should be experimental
(at least to start).

Agreed.

-Matt

I want to clarify a few restrictions of llvm-mca code regions that this RFC proposes:

1) All llvm-mca code regions must start with an llvm.mca.code.region.start intrinsic and end with
an llvm.mca.code.region.end intrinsic. This rule is enforced at the IR level in the IR verifier.

2) llvm-mca code regions cannot nest. This restriction implies that an llvm.mca.code.region.start
must have a llvm.mca.code.region.end intrinsic without any other llvm.mca start intrinsics
between the two. The current implementation in the patch enforces this restriction at the
IR level via the IR Verifier.

3) An llvm-mca code region cannot span multiple basic blocks. llvm-mca does not follow
branches (yet). Instead, a branch instruction is treated by llvm-mca like any other instruction.
The current patch associated with this RFC does not enforce this restriction. I plan on updating
the patch to enforce that a code region can only belong to a single basic block. This is a simple
check, ensuring that both the llvm.mca.code.region.start and accompanying end intrinsics live
in the same basic block. I imagine adding this check at the IR level when we also verify points 1 and 2
above. That will keep the code-region verification logic isolated to the IR verifier. The start/end
intrinsics should not have any uses, so I'm not sure that they would be moved/sunk on behalf
of any other instruction. In other words, I do not imagine that a start and end would be split
apart due to later MI optimizations. If I discover that such a case occurs, then I might add the
basic-block check prior to emitting the code region data to the object file. Once llvm-mca is
updated to handle branches, then we can remove this constraint.

-Matt

Thanks for clarifying it Matt.

In general, I quite like your suggested design.

My only concern is about the semantic of the two new intrinsics. You design doesn’t allow mca ranges to span through multiple basic blocks. That constraint is acceptable for now, since llvm-mca doesn’t know how to deal with control flow.

However, I am a bit concerned about what might happen in future if we decide to let users specify code regions that span through multiple basic blocks. Basically, I don’t particularly like the idea of changing the semantic of already existing intrinsic. A design that already accounts for that particular scenario/future work would be ideal. That being said, marking those new intrinsics as ‘experimental’ may be a good compromise (at least for now).

So, I am quite happy overall with the direction of this RFC.

However, I am interesting to hear from other developers about your suggested design.

This initial patch only targets ELF object files, and does not handle
relocatable addresses. Since the start of a code region is represented as an
assembly label, and referenced in the .mca_code_regions section, that address
is relocatable.

This may be okay for now. However, it would be nice to remove that constraint in future and add support to generic object files.

-Andrea

So, I have been thinking a bit more about this whole design.

The more I think about your suggested design, the more I am convinced that we should do something more to support ranges in binary object files too.
My understanding is that the reason why we don’t support object files in general, is because of the presence of relocations. That is because a region start marker is effectively symbol relative, and the symbol (a function) would be relocated in the final executable.
You mentioned to me that resolving even a ‘simple’ symbol-relative relocation is not trivial, beause it requires specific knowledge about the binary format, and the target (i.e. how relocations are encoded is target specific). I am surprised that there is not a utility library for resolving relocations… but I am not familiar with that part of the compiler. I was hoping that there was a target specific interface to use in this case…

An alternative approach would require that you define your own “symbol-relative” reference. After all, ranges are just a sequences of instructions in a function. If a function symbol is described by the symbol table, then you should be able to obtain its offset in the .text section. So, you could potentially encode your own symbol+offset. However, the linker would not be able to understand your “custom relocation”, and information about regions in the final elf would be basically broken. So,that would not be a solution…

I don’t know honestly what is the best approach to use in this case.
As a compromise, it would not be a bad idea to add the ability to specify ranges from command line. What do you think?
Still, from a user point of view, the idea that we don’t support object files in general sounds like a big limitation.

About the new experimental intrinsics: those would definitely work well for the simple case where instructions are from the same basic block.
However, some/most of the constraints that you plan to add will have to change if in future we decide to allow ranges that potentially cross multiple basic blocks. How will the rules/constraints on those new intrinsics change? I just want to make sure that the suggested design is future-proof.

-Andrea

Hi Andrea,

So, I have been thinking a bit more about this whole design.

The more I think about your suggested design, the more I am convinced that
we should do something more to support ranges in binary object files too.
My understanding is that the reason why we don't support object files in
general, is because of the presence of relocations. That is because a
region start marker is effectively symbol relative, and the symbol (a
function) would be relocated in the final executable.
You mentioned to me that resolving even a 'simple' symbol-relative
relocation is not trivial, beause it requires specific knowledge about the
binary format, and the target (i.e. how relocations are encoded is target
specific). I am surprised that there is not a utility library for resolving
relocations.. but I am not familiar with that part of the compiler. I was
hoping that there was a target specific interface to use in this case...

There might be a better way of resolving the relocs, but from what I saw
looking at llvm-objdump and other related tools, it seems that resolving
the relocated symbol is a target specific effort. I also spent sometime
sniffing around ExecutionEngine/RuntimeDyld/RuntimeDyld.cpp which also
performs the reloc resolution. I should clarify that I too am not an
expert in llvm's utilities for performing symbol/reloc resolution, and
perhaps someone in the community can point me in the right direction. I
can clearly see the reloc data in the object file via tools like
objdump; however, accessing the relocs via
llvm::object::ObjectFile::relocations() did not produce address values
that we could use (values of zero).

I was hoping that, for a first pass at this patch, supporting just
executables would be okay. That keeps this initial patch set simple,
and hopefully will encourage others to take a peek at it, since it's
less daunting than what it might otherwise be. Of course, there is the
concern that this initial patch will lock us into a design that will be
more complicated to unravel later.

An alternative approach would require that you define your own
"symbol-relative" reference. After all, ranges are just a sequences of
instructions in a function. If a function symbol is described by the symbol
table, then you should be able to obtain its offset in the .text section.
So, you could potentially encode your own symbol+offset. However, the
linker would not be able to understand your "custom relocation", and
information about regions in the final elf would be basically broken.
So,that would not be a solution...

I don't know honestly what is the best approach to use in this case.
As a compromise, it would not be a bad idea to add the ability to specify
ranges from command line. What do you think?
Still, from a user point of view, the idea that we don't support object
files in general sounds like a big limitation.

I agree, only supporting executables is a limitation. However, I'd
like to land the base support now and add in the additional
features/support after this large patch set lands. But I can see
where landing the whole thing entirely also makes sense.

About the new experimental intrinsics: those would definitely work well for
the simple case where instructions are from the same basic block.
However, some/most of the constraints that you plan to add will have to
change if in future we decide to allow ranges that potentially cross
multiple basic blocks. How will the rules/constraints on those new
intrinsics change? I just want to make sure that the suggested design is
future-proof.

Since the llvm/clang parts of the code are just responsible for
collecting where a range starts/ends, I hope that we can remove some
of the baked-in constraints that are specified in IR/Verifier.cpp.
As you pointed out earlier in this thread, we might want to
introduce a dominance check if/when we lift the one-basic-block
restriction.

-Matt

Hi Matt/Andrea,

I see pros and cons for IACA-style markers vs intrinsics.
On the one hand, IACA-style markers are very magical, and not very visible in both the source and object code. Using IACA-style markers has the advantage that you can use llvm-mca as a drop-in replacement for IACA, or even to compare their outputs on the exact same binary. They also do not require tooling on the compiler side and allow comparing the output of several compilers.
On the other hand, IACA-style markers do not have a equivalent on other architectures, and I’m not sure inventing new ones is a good idea :slight_smile: I think the latter makes them pretty much a no-go for llvm-mca as I don’t think we’ll want to teach each target how to parse code regions. That’s much better handled in a target-agnostic way by the object. Intel got away with them because they only had to support one architecture.

tl;dr: In the case of llvm-mca, I like your design better than the markers.

In terms of future-proofness of only allowing regions within a basic block, are we confident we can actually ever simulate branches apart from “always taken, perfectly predicated” loop ? Even this simple need requires knowing quite a few details on the frontend. The current design could handle this use case with the addition of an external “loop mode” option to MCA. If there are no other strong use cases, I would advocate for experimental intrinsics unless people can contribute other example use cases.

+1 to what Clement said.
I believe the intrinsics are a better design to support many architectures.

IACA users are probably decorating their code with IACA_START / IACA_END macros. One possibility is to provide a header that define these macros in terms of the new intrinsics.

Thanks for the feedback Guillaume and Clement!

In response to Clement:

> In terms of future-proofness of only allowing regions within a basic
> block, are we confident we can actually ever simulate branches apart from
> "always taken, perfectly predicated" loop ? Even this simple need requires
> knowing quite a few details on the frontend. The current design could
> handle this use case with the addition of an external "loop mode" option to
> MCA. If there are no other strong use cases, I would advocate for
> experimental intrinsics unless people can contribute other example use
> cases.

In short, I am in agreement and think that handling of branching or loop
constructs should be isolated to the llvm-mca driver/front-end. The
only thing the code regions should be concerned with is identifying
blocks of instructions that will later be used by the front end.

We can place limitations to how those blocks are formed. For example the
current implementation forces regions to be isolated to a single basic
block. However, we anticipate lifting this restriction once branching
is handled.

-Matt

Hi Matt,

I can see a near future where perf-analysis tooling uses branch history profiler captures to determine how often loops/branches are taken and feeds that into llvm-mca, especially for hot/branchy loop analysis reports etc. Are you confident that your approach will be easily extendable for this?

Similarly, being able to generally embed the profile markers in object libraries for reuse is going to be important for some people - I'd like to see more of a plan of how this will be achieved. I understand that it might not be easy for some exe formats.

Sorry if I'm being too critical, but I'm a bit worried that we end up with an initial implementation that will take a lot of reworking to meet our final aims.

Thanks, Simon.

Thanks for the response Simon. My reply is inline:

From: Simon Pilgrim <llvm-dev@redking.me.uk>
Sent: Monday, December 10, 2018 1:40 PM

Hi Matt,

I can see a near future where perf-analysis tooling uses branch history
profiler captures to determine how often loops/branches are taken and
feeds that into llvm-mca, especially for hot/branchy loop analysis
reports etc. Are you confident that your approach will be easily
extendable for this?

That is a very interesting use case. The restriction of a code-region to a single block
is a limitation for any tools that want to analyze branches. However, I believe
that it will be easy to lift this restriction (it's just a check in IR/Verifier). This limitation is not
expressed in the llvm-mca driver.

If the information is coming from a profile report, then we'd most
likely need to extend the llvm-mca driver to accept profile reports. Currently,
code regions, from the perspective of the llvm-mca driver, are very simple. They are
just a collection of MCInst. The binary support in this RFC+patch
disassembles just the address range from marker start address for
some specified number of bytes. It might be useful to add another driver argument
so that a user (or tool) can specify, from the command line, a range of instructions to
analyze. I recently added a class for handling inputs to llvm::mca::CodeRegionGenerator,
which is just responsible for taking some input and creating a list of MCInst that llvm-mca uses.
We could subclass this to handle profile reports.

Similarly, being able to generally embed the profile markers in object
libraries for reuse is going to be important for some people - I'd like
to see more of a plan of how this will be achieved. I understand that it
might not be easy for some exe formats.

That is definitely a limitation. This initial patch+RFC only handles linked
executables (i.e., the llvm-mca marker symbol addresses are resolved).
I'm working on a better solution so that this will not be a restriction.
In fact, I'll probably delay trying to land any patches until I solve relocations
(or use a different solution for identifying start/end addresses for llvm-mca code regions).

Sorry if I'm being too critical, but I'm a bit worried that we end up
with an initial implementation that will take a lot of reworking to meet
our final aims.

Thanks, Simon.

I understand your criticisms and value your input. Thanks a ton!

-Matt

Just an update to this RFC. (I thought this was going to be a short email... my apologies).

One of the primary limitations described in this RFC (earlier in this thread), is that the patch for this RFC only handles linked executables. This restriction is due to myself wanting to avoid handling relocations in a target specific manner, at least in the initial patch. Especially so, since I want to keep the initial patch simple. However, sometimes simple and practical are at odds with each other. I envision the main use-case of llvm-mca, with binary support, is for analyzing .o files. Probably analyzing .o more commonly than fully linked executables. Without handling relocated objects, this patch seems rather useless.

I started exploring an alternative solution this weekend to the aforementioned problem. This alternative solution avoids having to handle relocations, but does give us support for object files (with relocated symbols) and linked executables. The change is quite simple, and seems to be effective. In short, we still generate intrinsics as discussed in the RFC, one to mark the start of a code region, and another to mark the end. These intrinsics get lowered into local symbols. The symbols are already encoded with address information about their position in the object file. What is different is that we ensure that these symbols have unique names and also encode the user provided ID value.
  
Previously the labels were named like: .Lmca_code_region_start_<number>, and similar for .Lmca_code_region_end. The user id number and region size were encoded in the .mca_code_regions object file section. Previously mca never looked at the symbol table. But, In reality we can calculate the region size by using the symbols in the symbol table (look for the mca symbols), instead of relying on the information encoded in .mca_code_regions. The alternative approach gets rid of that section entirely but achieves the same functionality by encoding the information in the symbol name. In short the alternative approach just parses the symbol table for MCA symbols, and the symbol names are encoded with the data we need.

The newly proposed name is formatted as: .Lmca_code_region_start.<id>.<function>.<number>. Similar for mca_code_region_end. 'function' is the function that the marker appears in, 'ID' is the user-specified ID (this is a value that users specify for easily identifying the code region under analysis... just cosmetic), and 'number' is a unique number to avoid any duplicate name conflicts. The benefit of this alternative solution is that we can get rid of .mca_code_regions, and gather all of the information llvm-mca needs by parsing the symbol table looking for any symbols with the 'mca_code_region_start' and 'mca_code_region_end' format discussed above. Of course, if the string table is stripped, then we will lose this data. The main drawback from this alternative approach is that it relies on encoding symbol names and string processing on those names. I'm somewhat biased against doing string parsing, but the code to perform this is simple and small, and more importantly it allows llvm-mca to handle linked or relocated object files.

-Matt

Adding the llvm-dev list, because my email client decides to remove certain lists when I reply-all... including the list that I intend to respond to.

Hi Matt,

could you please update the associated patch too?

Thanks
-Andrea

Hi Andrea,

I’ve updated the patch here:

https://reviews.llvm.org/D54603/new/

I have not modified the description in the Phabricator. Of course, if this solution is the way-forward, then I will be happy to update the change description.

I also modified the format of the symbol names (from what I previously describe in my last response below), to save room: “.mca_code_region..”. Where “id” is the user’s specified identifier they want associated to the region (cosmetic), and “number” is a module unique number for the region. I have tested this with non-executable objects, executable files, and multiple objects linking into a single executable.

Also, I am trimming this thread up (sorry, but it’s starting to get unwieldy and I think the mailing list bot is grumpy).

-Matt

I wanted to write a response to describe the current state of this RFC and accompanying patch. I’ll try not to make this a wall of text.

A bit deeper in the thread we discussed a few limitations of the patch:

  1. The patch only worked on linked objects (relocations are resolved).

  2. The patch limited code regions to a single basic block.

a. This was a decision made because llvm-mca does not ‘follow’ branch instructions.

The most recent version of this patch (December 17th (minor clean up on the 23rd)) addresses #1. This means that code-regions can be defined in an object file (unlinked) and can be analyzed by llvm-mca. The solution implemented relies solely on the symbol table. llvm-mca can locate each code region based on address values specified by the symbol table, but it means that we have to name the start and end symbols in a way such that llvm-mca can locate them in the symbol table. The format is discussed earlier in this thread, but it relies on llvm-mca scanning the symbol table looking for symbols with the name “.mca_code_region_{start,end}.” format.

#2 Is trickier to solve, in the sense that there’s no right way. I removed the error checker (IR/Verifier.cpp) that rejected programs if a code region spans multiple blocks. In short, code regions are just labels (start,end) and llvm does not verify their depth (if they span multiple blocks). Instead, llvm now just checks that regions are terminated and that a regions are not nested (which the patch has always done). This means that llvm-mca will handle regions that span multiple blocks, as it currently does if the user would have manually instrumented the assembly input with #LLVM-MCA-BEGIN comments that encapsulate instructions that span multiple blocks. I hope that this is a reasonable solution?

In short I think we have addressed #1, and possibly #2. By lifting the restriction to #2 (removing the IR/Verifier check), we now put the responsibility for rejecting regions to llvm-mca (in the case of regions that span multiple basic blocks). In other words, when it comes to multiple blocks, this patch maintains the same semantics as the LLVM-MCA-BEGIN comments a user can add to their assembly. When branch handling is implemented in llvm-mca, we would want to lift this restriction in the complier anyways.

https://reviews.llvm.org/D54603/new/

-Matt

Forgot to cc the dev list (my email client seems to drop certain lists when reply-all is selected).