[RFC] Abstracting over SSA form IRs to implement generic analyses

Hi LLVM community,

Earlier this year I first proposed a new way of writing analyses that can be applied to multiple IRs, for example, applying the same analysis to both LLVM IR and MachineIR.

LLVM already has some analyses like that; for example, dominator tree construction and loop info. However, they’re all limited to looking at the control flow graph: basic blocks and lists of predecessors and successors. We want to push the envelope with a divergence analysis that is also aware of instructions and values, and ran into severe limitations in what we could do with the techniques that are commonly used in LLVM today. Some limitations are around which concepts are exposed generically at all, though the bulk of limitations revolves around readability and maintainability of the resulting generic code.

After more evolution of the ideas and many discussions over the last few months, I want to raise this proposal once more – this time with an extensive document: https://docs.google.com/document/d/1sbeGw5uNGFV0ZPVk6h8Q5_dRhk4qFnKHa-uZ-O3c4UY/edit?usp=sharing

Feel free to comment on the document, though high-level discussion is probably best kept in this email thread.

The concrete proposal is to enable 4 tools for use by generic analyses:

  • type erasure
  • an SsaContext context class concept with a fairly small surface area
  • dynamic polymorphism via per-analysis adapters
  • dynamic polymorphism via an LLVM-wide adapter of SsaContext

The document goes to some length to explain what precisely is meant by each of those bullets, including code examples, as well as describing a few other options that we don’t propose, based on their relative merits.

There are concrete patches that go along with the proposal and you can refer to for additional context. In logical sequence, they are:

I would like us to get to general agreement on this thread that this is a direction we want to go in and that we can proceed with the proposed code changes.

Thanks,
Nicolai

Hi Nicolai,

I’m quite sympathetic with the idea of abstracting away details of IRs and having a single implementation of analysis and optimizations over these abstractions. IMHO, that’s the main feature missing in MLIR.
That said, the analyses you consider in the document seem a bit limited. The first two, at least, related with dominators are just a few lines of codes. Which makes me wonder if the trouble, extra complexity, etc, of going this generic way is worth it. Isn’t it simple to just duplicate them in MLIR?

I would be very supportive of a generic alias analysis, for example. That’s a serious amount of work and potentially complicated algorithms that could be reused across LLVM and various MLIR dialects. Plus LLVM could use a new alias analysis algorithm. But that requires going a bit further than your proposal. One needs an abstraction that can cover several IRs and provide enough information to the AA algorithm in an efficient way. Also, it likely needs to know about undefined behavior. But MLIR has been avoiding that issue for now, AFAICT. So anything on that front would likely be wrong, as there hasn’t been much thought about UB in MLIR. (AFAIK; I don’t follow MLIR closely)
MLIR’s ODS doesn’t seem sufficient either, as it only allows (at least now) to attach shallow semantics attributes to instructions, similar to LLVM ones. Maybe that could be extended somehow?

TL;DR: While I’d love to see LLVM/MLIR go into the direction you are proposing, I’m not sure the set of considered analyses is sufficiently complicated to award the investment in complicated infrastructure.

Nuno

Hi Nuno,

I’m quite sympathetic with the idea of abstracting away details of IRs and having a single implementation of analysis and optimizations over these abstractions. IMHO, that’s the main feature missing in MLIR.
That said, the analyses you consider in the document seem a bit limited. The first two, at least, related with dominators are just a few lines of codes. Which makes me wonder if the trouble, extra complexity, etc, of going this generic way is worth it. Isn’t it simple to just duplicate them in MLIR?

The main analysis that motivates this work is divergence analysis, which is a lot more than just a few lines of code. It’s mentioned in the document but not pasted in there in full precisely because it’s so large and has complexities that are very much orthogonal to the proposal itself :slight_smile:

Those complexities also mean that subtle bugs have occurred in the past and we have to assume that we might find more in the future, so duplication of code seems unwise to me: keeping all copies in sync manually wouldn’t be fun.

It would also be a triplication of effort, actually, not just duplication, since we need to support MachineIR in order to make GlobalISel work properly for the AMDGPU target.

I hope this clarifies our motivation and why we think this is a good trade-off :slight_smile:

Cheers,
Nicolai

Thanks for sending this out!

+Mehdi - if you have a chance to look, I’d appreciate your thoughts on this, partly as a general LLVM contributor, but also with an eye to abstractions over multiple IRs given your work on MLIR
+Lang - you mentioned maybe some colleagues of yours might be able to take a look at the design here?

I’d like to try writing up maybe alternatives to a couple of the patches in the series to see how they compare, but haven’t done that as yet.

My brief take is that I think the opaque handles make a fair bit of sense/seem high-value/low-cost (if it turned out the handle for one of these IR types needed to become a bit heavier, I don’t think it’d be super difficult to revisit this code - make the generic handle large enough to accommodate whatever the largest concrete handle happens to be, etc). Though I’m still a bit less certain about the full runtime polymorphic abstractions.

Owen - perhaps you've got some relevant perspective here given the GPU
context/background, and general middle end optimization design
questions. Could you take a look?
(& perhaps Johannes?)

For context, this is about the options laid out in the following document from Nicolai:

The above document was first introduced in the following email, which is the starting point of the current thread:

https://lists.llvm.org/pipermail/llvm-dev/2020-December/147433.html

The main issue is the need for an abstraction that greatly improves the experience of writing analyses that work on LLVM IR and MIR (and potentially also on MLIR). The way I understand it, Nicolai proposed an abstraction that involves dynamic polymorphism to abstract out the details of the underlying IR. David was more in favour of first unifying the IRs to the point that the dynamic polymorphism is limited to only occasional corner cases instead.

I am currently involved in the activity towards decoupling AMDGPU's dependency on the abstractions being discussed here, and the one thing that immediately jumps out is the ability to traverse the predecessors and successors of a basic block. From all the emails so far and the resulting proposal, it seems that both David and Nicolai think that having a common non-abstract base class for basic blocks is a good idea. Such a base class will provide a vector of predecessors and successors that can be used to traverse the CFG independent of the actual IR.

The current situation is that the basic blocks in LLVM IR and MLIR do not explicitly track their preds and succs. Preds are determined from the uses of the block, while succs are determined from the terminator instruction. MIR has explicit pred and succ vectors, but the succ pointers are duplicated by their presence as operands to the terminator instructions. The MachineBasicBlock is not a value, so there is no notion of traversing its uses for predecessors.

So if we create a common non-abstract base class for basic blocks, then MLIR and LLVM IR will incur the overhead of two vectors, one each for the preds and succs. Nicolai had reported a 3.1% to 4.1% increase in the size LLVM IR in some typical programs:

Is this overhead the main issue in deciding whether we should introduce the common base class for traversing a CFG? Do note that MIR already duplicates successors. One could argue that duplicating the successor pointers in LLVM IR and MLIR merely brings them on the same level as MIR. If that argument sticks, then the "real" overhead is only half of what is reported to account for the duplicate predecessor pointers.

The duplication of predecessor pointers stems from the fact that basic blocks in LLVM IR and MLIR are also values that are (primarily) used as operands to the terminator and PHI instructions. We could redefine these instructions to use non-value basic blocks as operands. CFG edges are not the typical use-def relation that other values represent, and it's reasonable to say that PHINodes and terminators are special in this one way. I have not managed to see how far that rabbit hole goes, but I am not really convinced that this is attractive in any way. Is this even an option?

Sameer.

I’m not knowledgeable about everything you’re looking at, but one minor thing: in MLIR blocks aren’t regular value operands and aren’t part of the value hierarchy, they don’t participate in the usual use-def chains (and MLIR does not have Phi nodes either).
Note that LLVM takes advantage of this to implement the blockaddress instruction, which would be difficult in MLIR at the moment.

Best,

(FWIW, I'd still really love Mehdi, Owen, or anyone else's opinion on
the original proposal or any parts of this general problem - folks
working on some profile features have been continuing the template
direction to abstract over MIR+LLVM IR and if there's benefits to be
had here, it'd be great to get them... )

> Owen - perhaps you've got some relevant perspective here given the GPU
> context/background, and general middle end optimization design
> questions. Could you take a look?
> (& perhaps Johannes?)
>
> >
> > Thanks for sending this out!
> >
> > +Mehdi - if you have a chance to look, I'd appreciate your thoughts on this, partly as a general LLVM contributor, but also with an eye to abstractions over multiple IRs given your work on MLIR
> > +Lang - you mentioned maybe some colleagues of yours might be able to take a look at the design here?
> >
> > I'd like to try writing up maybe alternatives to a couple of the patches in the series to see how they compare, but haven't done that as yet.
> >
> > My brief take is that I think the opaque handles make a fair bit of sense/seem high-value/low-cost (if it turned out the handle for one of these IR types needed to become a bit heavier, I don't think it'd be super difficult to revisit this code - make the generic handle large enough to accommodate whatever the largest concrete handle happens to be, etc). Though I'm still a bit less certain about the full runtime polymorphic abstractions.

For context, this is about the options laid out in the following document from Nicolai:

LLVM Proposal: Abstracting over SSA form IRs (LLVM IR, SSA-MIR, MLIR) for generic analyses - Google Docs

The above document was first introduced in the following email, which is the starting point of the current thread:

[llvm-dev] [RFC] Abstracting over SSA form IRs to implement generic analyses

The main issue is the need for an abstraction that greatly improves the experience of writing analyses that work on LLVM IR and MIR (and potentially also on MLIR). The way I understand it, Nicolai proposed an abstraction that involves dynamic polymorphism to abstract out the details of the underlying IR. David was more in favour of first unifying the IRs to the point that the dynamic polymorphism is limited to only occasional corner cases instead.

To clarify, I think what I was leaning towards was having the APIs
match syntactically, so it'd be easier to write templates over them
without the need for type traits and adapters.

I am currently involved in the activity towards decoupling AMDGPU's dependency on the abstractions being discussed here, and the one thing that immediately jumps out is the ability to traverse the predecessors and successors of a basic block. From all the emails so far and the resulting proposal, it seems that both David and Nicolai think that having a common non-abstract base class for basic blocks is a good idea.

I don't /think/ that's what I was advocating for (but it's been a
while, and I might be misremembering - if you've got pointers to
particular parts of the conversation to refresh my memory/frame this
discussion, that'd be handy) - I'd worry about the cost of trying to
introduce a common base class between these different APIs. I think
what I was advocating for was for them to have syntactically matching
APIs so that templated code could abstract over them without the need
for traits/adapters.

Such a base class will provide a vector of predecessors and successors that can be used to traverse the CFG independent of the actual IR.

The current situation is that the basic blocks in LLVM IR and MLIR do not explicitly track their preds and succs. Preds are determined from the uses of the block, while succs are determined from the terminator instruction. MIR has explicit pred and succ vectors, but the succ pointers are duplicated by their presence as operands to the terminator instructions. The MachineBasicBlock is not a value, so there is no notion of traversing its uses for predecessors.

So if we create a common non-abstract base class for basic blocks, then MLIR and LLVM IR will incur the overhead of two vectors, one each for the preds and succs. Nicolai had reported a 3.1% to 4.1% increase in the size LLVM IR in some typical programs:

LLVM: Overhead of explicit predecessor / successor lists - Google Sheets

That's overhead in the bitcode file size, yes?

Is this overhead the main issue in deciding whether we should introduce the common base class for traversing a CFG?

I expect that would be a pretty significant concern, but also memory
usage and any time overhead keeping these new data structures up to
date, or slowing down compilation due to memory paging in and out,
etc.

David Blaikie writes:

I am currently involved in the activity towards decoupling AMDGPU's
dependency on the abstractions being discussed here, and the one
thing that immediately jumps out is the ability to traverse the
predecessors and successors of a basic block. From all the emails so
far and the resulting proposal, it seems that both David and Nicolai
think that having a common non-abstract base class for basic blocks
is a good idea.

I don't /think/ that's what I was advocating for (but it's been a
while, and I might be misremembering - if you've got pointers to
particular parts of the conversation to refresh my memory/frame this
discussion, that'd be handy) - I'd worry about the cost of trying to
introduce a common base class between these different APIs. I think
what I was advocating for was for them to have syntactically matching
APIs so that templated code could abstract over them without the need
for traits/adapters.

You're right. Turns out that you didn't specifically advocate what I
said above. The following is what was actually said, but I misunderstood
something else from the thread:

https://lists.llvm.org/pipermail/llvm-dev/2020-October/146090.html

  "I meant for a different direction - if we got buy-in for making LLVM IR and
   MIR implement the same C++ concept (& wrote down what that concept was -
   which APIs, how they look textually) then I think it would be relatively
   easy to get the small/mechanical renaming patches reviewed. Incremental
   from day 1, not big new abstraction, then incremental."

So if we create a common non-abstract base class for basic blocks,
then MLIR and LLVM IR will incur the overhead of two vectors, one
each for the preds and succs. Nicolai had reported a 3.1% to 4.1%
increase in the size LLVM IR in some typical programs:

LLVM: Overhead of explicit predecessor / successor lists - Google Sheets

That's overhead in the bitcode file size, yes?

It's the memory overhead of storing those pointers in a vector. I don't
think it needs to affect the bitcode, since it's just redundant
information that can be populated after reading the whole function.

Is this overhead the main issue in deciding whether we should
introduce the common base class for traversing a CFG?

I expect that would be a pretty significant concern, but also memory
usage and any time overhead keeping these new data structures up to
date, or slowing down compilation due to memory paging in and out,
etc.

Right. If the discussion progressed to those details, then we will be in
a good place :slight_smile:

Everyone seems to agree that we need an abstraction that makes it easier
to write analyses, and Nicolai's document captures all the options. But
in particular the document makes a case against templates, and instead
advocates a way to write analyses without referring to the actual types
in the underlying IR. This is achieved with a combination of opaque
handles and dynamic polymorphism. The document also strongly recommends
minimizing the dynamic polymorphism by designing "common non-abstract
base classes for SSA basic block and instructions concepts".

My focus right now is on the base class for basic blocks. It represents
a frequently used property of the CFG, and the problem is concise, and
already familiar to anyone writing an analysis. Our current experience
with writing new analyses in AMDGPU provides a motivation for moving
away from templates. But this is only a specific part of the bigger
problem. In particular, it only represents the "CFG" part of the
problem, but not the "SSA IR" part.

The actual use-case can be seen in the branch below, which implements
"CycleInfo" as a template over LLVM IR and MIR. It's a work in progress.

https://github.com/ssahasra/llvm-project/tree/cycleinfo-as-template

For now, we can consider a simpler analysis "CfgInfo" that reports
various properties such as outgoing edges per node.

Example A: CfgInfo over LLVM IR

    CfgInfo::compute(BasicBlock *entry) {
      worklist.enqueue(entry);
      while (!worklist.empty) {
        auto B = worklist.dequeue();
        for (auto S : successors(B)) {
           worklist.enqueue_if_new(S);
           collectStats(B, S);
        }
      }
    }

Example B: CfgInfo over Machine IR

    CfgInfo::compute(MachineBasicBlock *entry) {
      worklist.enqueue(entry);
      while (!worklist.empty) {
        auto B = worklist.dequeue();
        for (auto S : B->successors()) { // <-- diff
           worklist.enqueue_if_new(S);
           collectStats(B, S);
        }
      }
    }

Example C: Template over both IRs

    template <typename BlockType>
    CfgInfo::compute(BlockType *entry) {
      worklist.enqueue(entry);
      while (!worklist.empty) {
        auto B = worklist.dequeue();
        for (auto S : successors(B)) { // <-- needs an overload
           worklist.enqueue_if_new(S);
           collectStats(B, S);
        }
      }
    }

In this example, any instance of BlockType needs to provide a global
function "successors()". The following is sufficient with Machine IR in
this particular case, but in general, this would be provided by traits:

namespace llvm {
    auto successors(MachineBasicBlock *B) { return B->successors(); }
}

Example D: CfgInfo over a common base class

    CfgInfo::compute(BasicBlockBase *entry) {
      worklist.enqueue(entry);
      while (!worklist.empty) {
        auto B = worklist.dequeue();
        for (auto S : B->successors()) { // <-- not virtual
           worklist.enqueue_if_new(S);
           collectStats(B, S);
        }
      }
    }

This last version is not a template. It's an implementation that gets
reused for both LLVM IR and Machine IR by simply casting the respective
basic block type to the base type.

There are two main problems with the template approach:

1. Just for that one function "successors()", the type of the block has
   to be passed around as a template parameter. Starting from the entry
   point of the analysis, this has to go all the way to every leaf
   function that wants to call "successors()". This can especially be a
   problem if the template has other dependent types such as iterator
   adaptors.

2. The actual analysis can be quite complicated, but because it's a
   template, it has to be in a header file so that it can be
   instantiated at the point where the analysis is actually computed.

The CycleInfo patch demonstrates both these problems. There are multiple
files as follows:

GenericCycleInfo.h

This file defines the templates GenericCycleInfo<BlockType> and
GenericCycle<BlockType>. It also contains the entire implementation of
the analysis.

CycleInfo.h

- This file declares a thin wrapper CycleInfo around
  GenericCycleInfo<BasicBlock> and another wrapper Cycle around
  GenericCycle<BasicBlock>. These are the LLVM IR instances of the
  templates.
- The wrappers avoid including GenericCycleInfo.h by only needing the
  pointer type to the wrapped template instances.
- In particular, notice how the child_iterator explicitly exposes the
  details of the wrapped type, but then returns the wrapper with the `*'
  operator.

  using const_child_iterator_base =
    typename std::vector<std::unique_ptr<GenericCycle<BasicBlock>>>::const_iterator;
  struct const_child_iterator
      : iterator_adaptor_base<const_child_iterator, const_child_iterator_base> {
    using Base =
        iterator_adaptor_base<const_child_iterator, const_child_iterator_base>;

    Cycle operator *() const;
    const_child_iterator(const_child_iterator_base I) : Base(I) {}
  };

CycleInfo.cpp

This file is where the template is actually instantiated, along with the
implementation of the wrapper classes.

A similar pair of files will be needed for MachineCycleInfo later. These
will be almost identical to the CycleInfo files, except for the name of
the block type and the overload for "succesors()" and "predecessors()".

All these files are just boilerplate designed to hide a template and
separately expose two instances of that template. Every type introduced
by the template needs to be carefully wrapped for use in other
files. Instead, if there was an interface (and not just a concept) that
exposed CFG edges, there would be just one CycleInfo.h file to declare
the analysis classes and one CycleInfo.cpp file to implement them.

So the argument goes somewhat like this:

1. Static polymorphism through templates creates a development overhead
   in the form of boilerplate, and also some runtime overhead of indirecting
   through wrapper types.

2. Dynamic polymorphism through an abstract base class or an abstract
   context class removes the boilerplate, but it introduces the runtime
   overhead of indirect calls. It also retains the runtime overhead of
   indirecting through opaque wrappers.

3. A non-abstract base class eliminates all the above overheads (for
   this specific problem of CFG traversal), but introduces a memory
   overhead because some information from the derived classes has to be
   duplicated in the base class.

The CycleInfo analysis exemplifies our motivation for moving away from
templates and traits. Assuming virtual calls are considered strongly
undesirable, that leaves us with the option of a non-abstract base
class.

Sameer.

Thank you for picking this up, Sameer, and thank you David for pinging some folks. As context for others, I’m currently on an extended parental leave and consciously not editing any code that would usually be work-related.

A few high-level remarks from what I remember. Let’s always keep in mind that there are a whole bunch of puzzle pieces in flight here, and there’s always a risk that the discussion gets off track by losing sight of that bigger picture. Some of those puzzle pieces can hopefully be non-controversial. In particular, I’m thinking here of:

  1. Aligning the different IR classes so that they implement common concepts and allow writing templates idiomatically.
  2. Allowing the use of opaque handles as described in the document.

An observation about the performance implications of common base classes for basic blocks. I’m taking it as a given that a common base class would have predecessor and successor vectors. In addition to making predecessor and successor iteration simpler, this could actually end up saving memory for LLVM IR in many cases if we make the following additional changes:

  1. Change the definition of phi nodes so that they only contain the predecessor values in their argument lists (removing all the arguments referring to the predecessor basic blocks).
  2. Change the definition of terminator instructions so that they no longer contain the successor basic block(s) in their argument lists and instead implicitly refer to the basic blocks successor list. For example, a conditional branch would be defined such that its only argument is the i1 condition. If the condition is true, the program branches to brInst->getParent()->getSuccessors()[0]; if the condition is false, it branches to brInst->getParent()->getSuccessors()[1].

This is an option that I hadn’t taken into account when I calculated those 3-4% of memory overhead, but it should end up saving memory whenever a basic block has >= 2 phi nodes (the vectors themselves have an overhead, so it depends). My experiment would have to be re-done to get a better idea of the final balance.

The changes mentioned above are admittedly very invasive and complicated changes. It would be good to get an idea of what sort of acceptance criterion the LLVM community has here.

Cheers,
Nicolai

Thanks, that was an excellent summary and the example was really helpful (at least to me)!
I’m fairly old school here I guess, as I’d have naively written an abstract base class with virtual methods (or function_ref callback into the implementation), but there are concerns with the indirection compared to a templated implementation.
But aren’t we trading the cost of this indirection with even more cost if we have to duplicate information in a base class?

Best,

Nicolai Hähnle writes:

A few high-level remarks from what I remember. Let's always keep in mind
that there are a whole bunch of puzzle pieces in flight here, and there's
always a risk that the discussion gets off track by losing sight of that
bigger picture. Some of those puzzle pieces can hopefully be
non-controversial. In particular, I'm thinking here of:

1. Aligning the different IR classes so that they implement common concepts
and allow writing templates idiomatically.
2. Allowing the use of opaque handles
<LLVM Proposal: Abstracting over SSA form IRs (LLVM IR, SSA-MIR, MLIR) for generic analyses - Google Docs;
as described in the document.

I agree if you mean to say that the above two things are not mutually
exclusive, and both are beneficial. At least from my point of view,
templates over a common set of concepts are very valuable for writing
utility functions, but become cumbersome for writing large
analyses. Opaque handles (combined with a context class) can make it
easier to separate the IR-independent implementation from the
IR-dependent interface.

The context class provides a way to look up properties on the
opaque handle (such as the successors for a BlockHandle object) through
virtual functions. But having a non-abstract base class quickly reduce
the surface of these virtual functions, since the most common functions
like successors() are readily available in the base class itself.

That is related to what Mehdi asked in a sibling email:

I'm fairly old school here I guess, as I'd have naively written an
abstract base class with virtual methods (or function_ref callback
into the implementation), but there are concerns with the indirection
compared to a templated implementation.
But aren't we trading the cost of this indirection with even more cost
if we have to duplicate information in a base class?

Yes, we are avoiding indirect calls (not to be confused with just
indirection through a data pointer). It's useful to ask whether it is
okay to have indirect calls (through virtual functions) when computing
an analysis instead of doing something that affects the IR
itself. Is it an option to let each analysis worry about its own
performance, perhaps by caching internal results of the virtual calls?
Presumably, using the results of an analysis or updating the analysis
will only have an incremental cost from indirect calls?

If we want to move away from templates, then dynamic polymorphism cannot
be completely avoided, like when looking up the definition of a
value. But having a common base class can take out big chunks of the
surface area. The posterboy for this is CFG traversal along the
successor/predecessor pointers.

An observation about the performance implications of common base classes
for basic blocks. I'm taking it as a given that a common base class would
have predecessor and successor vectors. In addition to making predecessor
and successor iteration simpler, this could actually end up _saving_ memory
for LLVM IR in many cases if we make the following additional changes:

1. Change the definition of phi nodes so that they only contain the
predecessor _values_ in their argument lists (removing all the arguments
referring to the predecessor basic blocks).
2. Change the definition of terminator instructions so that they no longer
contain the successor basic block(s) in their argument lists and instead
implicitly refer to the basic blocks successor list. For example, a
conditional branch would be defined such that its only argument is the i1
condition. If the condition is true, the program branches to
brInst->getParent()->getSuccessors()[0]; if the condition is false, it
branches to brInst->getParent()->getSuccessors()[1].

This is an option that I hadn't taken into account when I calculated those
3-4% of memory overhead, but it should end up saving memory whenever a
basic block has >= 2 phi nodes (the vectors themselves have an overhead, so
it depends). My experiment would have to be re-done to get a better idea of
the final balance.

The changes mentioned above are admittedly very invasive and complicated
changes. It would be good to get an idea of what sort of acceptance
criterion the LLVM community has here.

Guess what, I had actually attempted this before reviving this thread!
My first experiment was to put an "assert(getParent())" in the
{get,set}Successor() methods. That didn't result in any interesting
failures. The real problem is with the terminator constructors. Usually
any Instruction::Create() sort of function optionally takes an
InsertBefore or InsertAtEnd argument. But if we require terminators to
access successors from the parent block, then variants of say
BranchInst::Create() that take successor pointers as arguments must now
require either an InsertBefore or an InsertAtEnd. That breaks all sorts
of code that manipulates the CFG.

Requiring a parent block when creating a terminator is a lot of work,
and I wasn't sure that the gain was worth the effort. It would need
refactoring many places that have been built around the assumption that
anyone can create an instruction in a vacuum. Instead, redefining the
BasicBlock to not be derived from the Value class seemed more practical
to me. It reduces the same overhead of redundantly tracking CFG
edges. For the special uses of llvm::BasicBlock as a value, we could
provide a new BasicBlockValue class. I tried commenting out the Value
class parent from the BasicBlock declaration, but that experiment is too
big to try on my own, especially without some consensus forming in this
discussion.

Sameer.

Sameer Sahasrabuddhe writes:

That is related to what Mehdi asked in a sibling email:

I'm fairly old school here I guess, as I'd have naively written an
abstract base class with virtual methods (or function_ref callback
into the implementation), but there are concerns with the indirection
compared to a templated implementation.
But aren't we trading the cost of this indirection with even more cost
if we have to duplicate information in a base class?

Yes, we are avoiding indirect calls (not to be confused with just
indirection through a data pointer). It's useful to ask whether it is
okay to have indirect calls (through virtual functions) when computing
an analysis instead of doing something that affects the IR
itself. Is it an option to let each analysis worry about its own
performance, perhaps by caching internal results of the virtual calls?
Presumably, using the results of an analysis or updating the analysis
will only have an incremental cost from indirect calls?

I did a little experiment with the bitcode sample files provided by
Jakub Kuderski [1], and mentioned in Nicolai's original proposal [2]:

[1] oss_bitcode - Google Drive
[2] [llvm-dev] [RFC] Abstracting over SSA form IRs to implement generic analyses

The experiment measures the impact of virtual function calls on the
cycleinfo analysis:

Static Polymporphism: Version that uses templates to implement the same
analysis on LLVM IR and Machine IR.

Dynamic Polymorphism: Version that uses the SsaContext with virtual functions.

Total times for cycleinfo on LLVM IR (tab-separated) are as follows:

Name Static Dynamic %-change
asm2wasm.bc 0.0725 0.0758 4.5517%
clang-9.bc 1.1039 1.1608 5.1545%
lld.bc 0.5209 0.5442 4.4730%
llvm-as.bc 0.1165 0.122 4.7210%
llvm-dis.bc 0.0529 0.0554 4.7259%
opt.bc 0.4562 0.4789 4.9759%
rippled.bc 0.4007 0.4188 4.5171%
sqlite3.bc 0.0072 0.0077 6.9444%
wasm-as.bc 0.0683 0.0713 4.3924%
wasm-ctor-eval.bc 0.0687 0.0723 5.2402%
wasm-dis.bc 0.0681 0.071 4.2584%
wasm-emscripten-finalize.bc 0.068 0.0718 5.5882%
wasm-metadce.bc 0.0686 0.0724 5.5394%
wasm-opt.bc 0.072 0.0759 5.4167%
wasm-reduce.bc 0.0686 0.0725 5.6851%
wasm-shell.bc 0.0694 0.0728 4.8991%
wasm2js.bc 0.0714 0.0751 5.1821%

On average, the virtual calls are slower by 5%.

An observation about the performance implications of common base classes
for basic blocks. I'm taking it as a given that a common base class would
have predecessor and successor vectors. In addition to making predecessor
and successor iteration simpler, this could actually end up _saving_ memory
for LLVM IR in many cases if we make the following additional changes:

1. Change the definition of phi nodes so that they only contain the
predecessor _values_ in their argument lists (removing all the arguments
referring to the predecessor basic blocks).
2. Change the definition of terminator instructions so that they no longer
contain the successor basic block(s) in their argument lists and instead
implicitly refer to the basic blocks successor list. For example, a
conditional branch would be defined such that its only argument is the i1
condition. If the condition is true, the program branches to
brInst->getParent()->getSuccessors()[0]; if the condition is false, it
branches to brInst->getParent()->getSuccessors()[1].

This is an option that I hadn't taken into account when I calculated those
3-4% of memory overhead, but it should end up saving memory whenever a
basic block has >= 2 phi nodes (the vectors themselves have an overhead, so
it depends). My experiment would have to be re-done to get a better idea of
the final balance.

The changes mentioned above are admittedly very invasive and complicated
changes. It would be good to get an idea of what sort of acceptance
criterion the LLVM community has here.

I made an initial attempt at scratching the surface of this. As an
experiment, I changed IRBuilder::CreateBr() to assert that a direct
branch can only be created at the end of a known basic block:

https://github.com/ssahasra/llvm-project/commit/2ea1deca2df736c91c1d00aa26528d7f14336bad

The entire patch is a set of fairly independent changes to various uses
of the IRBuilder. The most interesting changes are in SimplifyCFG.cpp,
but even there, some changes are independent. My overall impression is
that it is feasible to change how CFG edges are stored in LLVM IR.

The main problem with the BasicBlock is:
- Successors are stored in the terminator instruction.
- Predecessors are stored indirectly because the BasicBlock is also a
  Value, and the terminator in each predecessor is a Use.
- Predecessors are also stored in each PHI node.

All these copies will need to be eliminated as follows:

1. Require a parent basic block when accessing or modifying the CFG
   edges on a terminator or a PHI node.
2. Any constructor that takes a target block will require one of the
   following:
   a. An InsertAtEnd argument (parent block).
   b. Or, in the case of PHI, an InsertBefore argument. Inserting a
      terminator before another terminator will not be supported.
3. Disallow the removeFromParent() method for terminators. It may still
   be supported for a PHI, where the input Values can still be accessed
   without any knowledge of the incoming edges.
4. Potentially add a moveToNewParent() method.
5. Corresponding changes to IRBuilder. For example, CreateBr() will
   assert that the InsertPt is at the end of a valid basic block which
   does not already have a terminator.

The BasicBlock will still be Use'd in a BlockAddress constant, but
that's a separate kind of use anyway.

This should not change bitcode in any way; it's only intended as a
change to in-memory IR. There may be some impact on the bitcode reader.

There's a good chance that I missed something important. But if all goes
well, most of this is just a long list of shallow changes, with some
changes being more involved than others.

Sameer.