RFC Storing BB order in llvm::Instruction for faster local dominance

Hi folks,

I looked into some slow compiles and filed https://bugs.llvm.org/show_bug.cgi?id=38829. The gist of it is that we spend a lot of time iterating basic blocks to compute local dominance, i.e. given two instructions in the same BB, which comes first.

LLVM already has a tool, OrderedBasicBlock, which attempts to address this problem by building a lazy mapping from Instruction* to position. The problem is that cache invalidation is hard. If we don’t cache orderings at a high enough level, our transformations become O(n^2). If we cache them too much and insert instructions without renumbering the BB, we get miscompiles. My solution is to hook into the actual BB ilist modification methods, so that we can have greater confidence that our cache invalidation is correct.

I created a patch for this at https://reviews.llvm.org/D51664, which adds a lazily calculated position integer to every llvm::Instruction. I stole a bit from BasicBlock’s Value subclass data to indicate whether the orders are valid.

Hopefully everyone agrees that this a reasonable direction. I just figured I should announce this IR data structure change to the -dev list. :slight_smile:

Reid

+1 to the general direction

My concern is the invalidation/recompute cost, but I think we can manage that.

Philip

Sounds great! -Hal

I haven’t had a chance to look at the patch in detail yet (hopefully this afternoon) but this sounds like a very invasive change to a core data structure.

The inner loop of the local dominance check in DominatorTree::dominates is also not very well implemented: it does a single linear pass from the beginning of the block until it finds the def or user. A better algorithm would be to use two pointers - one at the user and def. Each time through the loop, move the user iterator “up” the block, and the def iterator “down” the block. Either the iterators meet each other (in which case return true) or you fine the beginning/end of the block.

This should work a lot better for many queries, because it will be efficient when the user and def are close to each other, as well as being efficient when the value is at the end of the block. Also, my bet is that most local dom queries return true.

Have you tried this approach? It should be very easy to hack up to try out on your use case.

-Chris

Indeed. Perhaps a long-overdue one :wink: This seems like a good idea. It doesn’t change the fact, however, that local dominance queries are O(n). We’ve ended up using OrderedBasicBlock in an increasing number of places, but there are a number of places where this is hard because of the plumbing required, or more importantly, the ambiguity around who owns the state of the cache at any given time. We know that there are a significant number of additional places where we should be using something like OrderedBasicBlock, but adding OBB into many of these places would be quite non-trivial. As indicated by the performance results briefly described in D51664, we have significant headroom. I don’t see any really feasible way around these issues except moving the ownership of that state into the BB itself. Thanks again, Hal

To echo what Hal said, yes, it’s a major change, but I think the improved complexity guarantees, simplicity, and elimination of certain classes of bugs is worth it.

I think we have consensus that we should go forward with this. Would anyone mind formally stamping it in phab? So far everyone understandably has said “makes sense to me, but I don’t consider myself to have authority to stamp this.”

Did you consider using the waymarking algorithm we already use for
going from Use->User to store the offset of an instruction in a basic
block? We could steal the two bits needed from the bb parent pointer
in the instruction.

-- Sanjoy

Did you consider using the waymarking algorithm we already use for
going from Use->User to store the offset of an instruction in a basic
block? We could steal the two bits needed from the bb parent pointer
in the instruction.

Actually, now that I think of it, in llvm::Instruction we have 6 bits
to play with so should be able to create a more efficient scheme than
the one used in llvm::Use (well, we have 6 bits even in llvm::Use but
efficiency is probably not as important there because Use lives in an
array, not a linked list).

-- Sanjoy

I hadn’t considered it, and it’s an interesting idea. I’d rather not pursue it because I think stealing bits from ilist just for Instruction linked lists is going to be difficult. It also has runtime costs for linked list iteration, so it’s not completely free. I’d rather get the speed win, spend the memory, and then try to win back the space if it seems like a priority.

We don't need to steal from ilist Prev/Next if we're okay with the
vanilla two-bits-per-node waymarking (which we can get from the parent
basicblock pointer) which may be good enough since it will reduce the
number of hops needed for local dominance from O(n) to O(log n). I
was kind of hoping that it would be straightforward to pull out the
waymarking code out of Value and re-using it; but if you'd prefer not
doing that now I'm fine with it.

-- Sanjoy

Hi Reid,

I have significant concerns with this patch. I’d really appreciate it if you could address the discussion feedback here before just plowing forward with such an invasive change to the core IR.

Have you done any memory use analysis or compile time impact analysis of this change? Have you consider the algorithm I mentioned? I am not motivated by your rationale above, and I think it will be a huge step forward and effectively define away the problem.

-Chris

To be clear, I’m a superfan of your work and agree that “automatic invalidation and simplification” are good goals. I’d just love to see more analysis of the situation before burning space on this extra field.

For example, it would be very interesting to see what the typical distribution of local dominance queries actually are. You could take your patch here and have it instrument the “Instruction::comesBefore” method to log the distance between the instruction numbers to a file and generate a histogram of them. My bet is that the vast vast vast majority of them will be small (and I’d guess a HUGE number of them have a distance of +1/-1, whose query can be resolved without computing a numbering in the first place).

Can you do that experiment and report back, e.g. on the SemaChecking.cpp testcase?

-Chris

I think we have consensus that we should go forward with this. Would anyone mind formally stamping it in phab? So far everyone understandably has said "makes sense to me, but I don’t consider myself to have authority to stamp this.”

Hi Reid,

I have significant concerns with this patch. I’d really appreciate it if you could address the discussion feedback here before just plowing forward with such an invasive change to the core IR.

Sorry for that, I’m at a conference and I’m trying to push this along in the background without doing any real work on it. I’ve convinced myself it’s the right change, and that’s enough to make me impatient.

Have you done any memory use analysis or compile time impact analysis of this change? Have you consider the algorithm I mentioned? I am not motivated by your rationale above, and I think it will be a huge step forward and effectively define away the problem.

To be clear, I’m a superfan of your work and agree that “automatic invalidation and simplification” are good goals. I’d just love to see more analysis of the situation before burning space on this extra field.

For example, it would be very interesting to see what the typical distribution of local dominance queries actually are. You could take your patch here and have it instrument the “Instruction::comesBefore” method to log the distance between the instruction numbers to a file and generate a histogram of them. My bet is that the vast vast vast majority of them will be small (and I’d guess a HUGE number of them have a distance of +1/-1, whose query can be resolved without computing a numbering in the first place).

Can you do that experiment and report back, e.g. on the SemaChecking.cpp testcase?

Fair enough. I made a patch to do that, and uploaded it here: https://reviews.llvm.org/D52510
I put it into a spreadsheet and made a histogram here: https://goo.gl/AMLBt7
I grouped the typical query distances into natural log buckets, and the table is small enough to paste here:

distance freq
1-2 3707
2-7 5622
7-20 25764
20-54 1726
54-148 2052
148-403 4462
403-1096 8423
1096-2980 20487
2980-8103 34930
8103-22026 63759
22026-59874 41729

I think the table shows that local dominance queries are, at least for SemaChecking.cpp, not clustered around small distances. The average query is somewhere out around 11,000 instructions apart.

One thing to note is that the code in question isn’t using DominatorTree::dominates. That was identified as a hotspot, so it’s using OrderedBasicBlock, but it’s not caching it at a high enough level to avoid quadratic behavior. Raising it up isn’t easy, hence this patch.

Sanjoy also wanted more data and suggested that I re-run the max RSS during full LTO of llc experiment that I did when I removed the vptr from Value. I went ahead and did that on the review, but it’s not on the dev list, so it may not have enough visibility. I found that the max RSS before the patch was 4816464 kb, and after, 4867836 kb. That’s an increase of 1.06%, or in absolute terms, ~50MB of memory spent on instruction positions during full LTO of llc.

I think Sanjoy’s bit-stealing suggestion is a worthwhile alternative to growing Instruction, but I really think we should analyze it from the position of, is the complexity of bit-stealing from Parent worth it to shrink Instruction?

Anyway, hopefully this looks compelling. I’ll put some of it in the patch description before landing.

>> I think we have consensus that we should go forward with this. Would anyone mind formally stamping it in phab? So far everyone understandably has said "makes sense to me, but I don't consider myself to have authority to stamp this.”
>
> Hi Reid,
>
> I have significant concerns with this patch. I’d really appreciate it if you could address the discussion feedback here before just plowing forward with such an invasive change to the core IR.

Sorry for that, I'm at a conference and I'm trying to push this along in the background without doing any real work on it. I've convinced myself it's the right change, and that's enough to make me impatient.

> Have you done any memory use analysis or compile time impact analysis of this change? Have you consider the algorithm I mentioned? I am not motivated by your rationale above, and I think it will be a huge step forward and effectively define away the problem.

To be clear, I’m a superfan of your work and agree that “automatic invalidation and simplification” are good goals. I’d just love to see more analysis of the situation before burning space on this extra field.

For example, it would be very interesting to see what the typical distribution of local dominance queries actually are. You could take your patch here and have it instrument the “Instruction::comesBefore” method to log the distance between the instruction numbers to a file and generate a histogram of them. My bet is that the vast vast vast majority of them will be small (and I’d guess a HUGE number of them have a distance of +1/-1, whose query can be resolved without computing a numbering in the first place).

Can you do that experiment and report back, e.g. on the SemaChecking.cpp testcase?

Fair enough. I made a patch to do that, and uploaded it here: ⚙ D52510 Log OrderedBasicBlock distances
I put it into a spreadsheet and made a histogram here: OrderedBasicBlock query distances - Google Sheets
I grouped the typical query distances into natural log buckets, and the table is small enough to paste here:

distance freq
1-2 3707
2-7 5622
7-20 25764
20-54 1726
54-148 2052
148-403 4462
403-1096 8423
1096-2980 20487
2980-8103 34930
8103-22026 63759
22026-59874 41729

I think the table shows that local dominance queries are, at least for SemaChecking.cpp, not clustered around small distances. The average query is somewhere out around 11,000 instructions apart.

One thing to note is that the code in question isn't using DominatorTree::dominates. That was identified as a hotspot, so it's using OrderedBasicBlock, but it's not caching it at a high enough level to avoid quadratic behavior. Raising it up isn't easy, hence this patch.

Sanjoy also wanted more data and suggested that I re-run the max RSS during full LTO of llc experiment that I did when I removed the vptr from Value. I went ahead and did that on the review, but it's not on the dev list, so it may not have enough visibility. I found that the max RSS before the patch was 4816464 kb, and after, 4867836 kb. That's an increase of 1.06%, or in absolute terms, ~50MB of memory spent on instruction positions during full LTO of llc.

I think Sanjoy's bit-stealing suggestion is a worthwhile alternative to growing Instruction, but I really think we should analyze it from the position of, is the complexity of bit-stealing from Parent worth it to shrink Instruction?

Let's not assume a dichotomy between storing a int64 in
llvm::Instruction and bitwise tricks -- they're both ways of caching
position within llvm::Instruction. I think we first need to establish
that we need such a cache in llvm::Instruction/llvm::BasicBlock at
all.

Do you have a sense of where these expensive domtree queries are
coming from? Are they from a couple of key places or are they
scattered throughout the pass pipeline?

-- Sanjoy

-- Sanjoy

When I dug into the profile with WPA, most of the time was spent in DSE and memcpyopt, which call AAResults::callCapturesBefore, which calls llvm::PointerMayBeCapturedBefore. Neither pass in an OrderedBasicBlock, so they rebuild the OrderedBasicBlock in linear time on every query. These passes insert instructions, so it’s not correct to simply create and reuse an OrderedBasicBlock at this level.

As suggested in the bug, if we were to rewrite these passes to use MemorySSA, this bottleneck would go away. I rebased a patch to do that for DSE, but finishing it off and enabling it by default is probably out of scope for me.

These key call sites aside, dominance seems like a pretty common query. We have several somewhat overlapping mechanisms for checking dominance:

  • DominatorTree::dominates(Instruction*,Instruction*), the original interface, widely known as a bottleneck

  • OrderedBasicBlock, as used by memdep, my motivating example

  • OrderedInstructions, uses OrderedBasicBlock, used by GVN and others

  • MemorySSA has a similar ordering of memory values that is also invalidated when inserting memory defs. I may have misunderstood how this works, though.

After thinking more about this over the day, I thought about the bugs we have when the cache isn’t on the BasicBlock or Instruction. We have one here: https://reviews.llvm.org/D50377

I would say that, while the expensive queries come from just a few passes (DSE & memcpyopt), OrderedBasicBlock is widely used by more than just a few transforms. Putting aside performance, the existence of all those OrderedBasicBlock caches tells me that yes, we do need to cache this info, and it is widely used. Maybe it doesn’t need to live directly on Instruction, but the cache should be maintained by BasicBlock.

Let’s not assume a dichotomy between storing a int64 in
llvm::Instruction and bitwise tricks – they’re both ways of caching
position within llvm::Instruction. I think we first need to establish
that we need such a cache in llvm::Instruction/llvm::BasicBlock at
all.

Do you have a sense of where these expensive domtree queries are
coming from? Are they from a couple of key places or are they
scattered throughout the pass pipeline?

+1.

When I dug into the profile with WPA, most of the time was spent in DSE and memcpyopt, which call AAResults::callCapturesBefore, which calls llvm::PointerMayBeCapturedBefore. Neither pass in an OrderedBasicBlock, so they rebuild the OrderedBasicBlock in linear time on every query. These passes insert instructions, so it’s not correct to simply create and reuse an OrderedBasicBlock at this level.

So this is one of the reasons I find your patch to be problematic: it isn’t fixing N^2 behavior, it is merely changing one N^2 situation for another. AFAICT there are one of two possible cases in these sorts of transformations:

  1. These transformations are iteratively removing or inserting instructions, which invalidate the ordering with your approach, causing subsequent queries to rebuild the equivalent of OrderedBasicBlock.

  2. These transformations are not doing anything, in which case this is all wasted work.

As a next step, can you please instrument the calls to calls from DSE and/or memcpyopt and see how many of the ones for the huge basic blocks actually turn into a transformation in the code? If there are zero, then we should just disable these optimizations for large blocks. If there are a few improvements, then we should see how to characterize them and what the right solution is based on the sorts of information they need.

LLVM is occasionally maligned for the magic numbers like “6” in various optimizations that bound the work in various analyses (e.g. when computing masked bits) but they exist for very real reasons: the cases in which a large search will actually lead to benefits are marginal to non-existent, and the possibility for unbounded queries for core APIs cause all kinds of problems.

I see the dependence analysis queries as the same sort of situation: until something like MemorySSA is able to define away these dense queries, we should be using bounded searches (in this case larger than 6 but less than 11,000 :-).

It would be straight-forward to change llvm::BasicBlock to keep track of the number of instruction’s in it (doing so would not change any ilist algorithmic behavior that I’m aware of given the need to keep instruction parent pointers updated), and having that info would make it easy to cap linear passes like this or switch into local search modes.

The cost of such a thing is the risk of performance regressions. We control that risk by doing some analysis of the payoff of these transformations on large blocks - if the payoff doesn’t exist then it is a simple answer. The real question here is not whether DSE is able to eliminate a single store, it is whether eliminating that store actually leads to a measurable performance improvement in the generated code. Given huge blocks, I suspect the answer is “definitely not”.

As suggested in the bug, if we were to rewrite these passes to use MemorySSA, this bottleneck would go away. I rebased a patch to do that for DSE, but finishing it off and enabling it by default is probably out of scope for me.

Yes, understood, that would be a major change. That said, changing the entire architecture of the compiler to work around this isn’t really appealing to me, I’d rather limit the problematic optimizations on the crazy large cases.

-Chris

So this is one of the reasons I find your patch to be problematic: it isn’t fixing N^2 behavior, it is merely changing one N^2 situation for another. AFAICT there are one of two possible cases in these sorts of transformations:

  1. These transformations are iteratively removing or inserting instructions, which invalidate the ordering with your approach, causing subsequent queries to rebuild the equivalent of OrderedBasicBlock.

I would say that this code fixes the quadratic behavior in practice, and lays the foundation to fix it permanently if it becomes a problem.

First, removing instructions doesn’t invalidate the ordering, and these passes mostly delete instructions, so in practice, the latent quadratic behavior is very hard to exercise.

In the future, if profiling shows that “insert;query;insert;query” is a bottleneck, we can fix it by leaving gaps in the numbering space and assigning new numbers halfway between adjacent instructions, renumbering as required to maintain amortized O(1) insertion complexity. I think it’s unlikely that we will ever need to implement this fancy renumbering algorithm, and people on the bug agree (https://llvm.org/pr38829#c2). Existing code that uses OrderedBasicBlock already doesn’t implement this algorithm. According to existing code, invalidating on modification is good enough.

  1. These transformations are not doing anything, in which case this is all wasted work.

As a next step, can you please instrument the calls to calls from DSE and/or memcpyopt and see how many of the ones for the huge basic blocks actually turn into a transformation in the code? If there are zero, then we should just disable these optimizations for large blocks. If there are a few improvements, then we should see how to characterize them and what the right solution is based on the sorts of information they need.

LLVM is occasionally maligned for the magic numbers like “6” in various optimizations that bound the work in various analyses (e.g. when computing masked bits) but they exist for very real reasons: the cases in which a large search will actually lead to benefits are marginal to non-existent, and the possibility for unbounded queries for core APIs cause all kinds of problems.

I see the dependence analysis queries as the same sort of situation: until something like MemorySSA is able to define away these dense queries, we should be using bounded searches (in this case larger than 6 but less than 11,000 :-).

It would be straight-forward to change llvm::BasicBlock to keep track of the number of instruction’s in it (doing so would not change any ilist algorithmic behavior that I’m aware of given the need to keep instruction parent pointers updated), and having that info would make it easy to cap linear passes like this or switch into local search modes.

The cost of such a thing is the risk of performance regressions. We control that risk by doing some analysis of the payoff of these transformations on large blocks - if the payoff doesn’t exist then it is a simple answer. The real question here is not whether DSE is able to eliminate a single store, it is whether eliminating that store actually leads to a measurable performance improvement in the generated code. Given huge blocks, I suspect the answer is “definitely not”.

As suggested in the bug, if we were to rewrite these passes to use MemorySSA, this bottleneck would go away. I rebased a patch to do that for DSE, but finishing it off and enabling it by default is probably out of scope for me.

Yes, understood, that would be a major change. That said, changing the entire architecture of the compiler to work around this isn’t really appealing to me, I’d rather limit the problematic optimizations on the crazy large cases.

You’re probably right, these passes probably aren’t actually firing. But, it sounds like we have two options:

  1. Add cutoffs to poorly understood slow passes that are misbehaving
  2. Make a core data structure change to make dominance calculations faster and simpler for all transforms

I can do this analysis, and add these cutoffs, but I wouldn’t feel very good about it. It adds code complexity, we’ll have to test it, and tomorrow someone will add new quadratic dominance queries. I don’t see how heuristically limiting misbehaving optimizations builds a better foundation for LLVM tomorrow.

Dominance has been and will always be an important analysis in LLVM. This patch is basically pushing an analysis cache down into IR. Would it be accurate to say that your objection to this approach has more to do with analysis data living in IR data structures?

If I added more infrastructure to invert the dependency, store these instruction numbers in DominatorTree, and make BasicBlock notify DominatorTree of instruction insertion and deletion, would that be preferable? That’s very similar to the code we’re writing today, where the transform takes responsibility for marrying its modifications to its invalidations. The question is, do we want to keep this stuff up to date automatically, like we do for use lists, or is this something we want to maintain on the side? I think dominance is something we might want to start pushing down into IR.

As suggested in the bug, if we were to rewrite these passes to use MemorySSA, this bottleneck would go away. I rebased a patch to do that for DSE, but finishing it off and enabling it by default is probably out of scope for me.

Yes, understood, that would be a major change. That said, changing the entire architecture of the compiler to work around this isn’t really appealing to me, I’d rather limit the problematic optimizations on the crazy large cases.

You’re probably right, these passes probably aren’t actually firing. But, it sounds like we have two options:

  1. Add cutoffs to poorly understood slow passes that are misbehaving
  2. Make a core data structure change to make dominance calculations faster and simpler for all transforms

I can do this analysis, and add these cutoffs, but I wouldn’t feel very good about it. It adds code complexity, we’ll have to test it, and tomorrow someone will add new quadratic dominance queries. I don’t see how heuristically limiting misbehaving optimizations builds a better foundation for LLVM tomorrow.

My answer to this is that the long term path is to move these passes to MemorySSA, which doesn’t have these problems. At which point, the core IR change you are proposing becomes really the wrong thing.

Dominance has been and will always be an important analysis in LLVM. This patch is basically pushing an analysis cache down into IR. Would it be accurate to say that your objection to this approach has more to do with analysis data living in IR data structures?

If I added more infrastructure to invert the dependency, store these instruction numbers in DominatorTree, and make BasicBlock notify DominatorTree of instruction insertion and deletion, would that be preferable? That’s very similar to the code we’re writing today, where the transform takes responsibility for marrying its modifications to its invalidations. The question is, do we want to keep this stuff up to date automatically, like we do for use lists, or is this something we want to maintain on the side? I think dominance is something we might want to start pushing down into IR.

I have several objections to caches like this, and we have been through many failed attempts at making analyses “autoadapt” to loosely coupled transformations (e.g. the CallbackVH stuff, sigh). My objections are things like:

  1. Caching and invalidation are very difficult to get right.
  2. Invalidation logic slows down everyone even if they aren’t using the cache (your “check to see if I need to invalidate when permuting the ilist).
  3. We try very hard not to put analysis/xform specific gunk into core IR types, because it punishes everyone but benefits only a few passes.
  4. LLVM is used for a LOT of widely varying use cases - clang is just one client, and many of them don’t run these passes. This change pessimizes all those clients, particularly the most sensitive 32-bit systems.
  5. If done wrong, these caches can lead break invariants that LLVM optimization passes are supposed to maintain. For example, if you do “sparse numbering” then the actual numbering will depend on the series of transformations that happen, and if someone accidentally uses the numbers in the wrong way, you could make “opt -pass1 -pass2” behave differently than “opt -pass1 | opt -pass2”.

Putting this stuff into DominatorTree could make sense, but then clients of dominator tree would have to invalidate this when doing transformations. If you put the invalidation logic into ilist, then many of the problems above reoccur. I don’t think (but don’t know) that it is unreasonable to make all clients of DT invalidate this manually.

-Chris

Maybe I’m missing something, but why do you say this? MemorySSA certainly collects the memory references in a basic block into a set of use/def chains, but determining dominance still requires walking those chains. This might help reduce the constant factor on the O(N), because it skips the non-memory-access-instructions, but the underlying complexity problem remains. Maybe MemorySSA should cache a local numbering, but… MemorySSA only helps if you’re fine with skipping non-aliasing accesses. It’s not clear to me that this is always the case. For example, imagine that we’re trying to do something like SLP vectorization, and so we follow the use-def chain of an address calculation and find two adjacent, but non-aliasing, loads in the same basic block. We want to convert them into a vector load, so we need to place the new vector load at the position of the first (dominating) load. MemorySSA will help determine legality of the transformation, but MemorySSA won’t help determine the domainance because it will have the MemorySSA nodes for both loads tied, potentially, to the same MemorySSA definition node (i.e., you can’t walk from one to the other, by design, which is why it helps with the legality determination). Dominance is a fundamental construct to an SSA-form IR. I certainly agree with you in general, but I’m not convinced by this reasoning in this case. I don’t see why this has anything to do with Clang. As I understood it, the self-host compile time of a large Clang source file was being used only as an example. This change may very well help many clients. Thanks again, Hal