[RFC] Simple GVN hoist

Looking at a particularly hot function in the SPEC/x264, that LLVM fails to
vectorise.

typedef short int16_t;
typedef unsigned short uint16_t;

int q(int16_t d, uint16_t m, uint16_t b) {
int n = 0;
for (int i = 0; i < 16; i++) {
if (d[i] > 0)
d[i] = (b[i] + d[i]) * m[i] >> 16;
else
d[i] = -((b[i] - d[i]) * m[i] >> 16);
n |= d[i];
}
return n;
}

As it turns out, LLVM adds runtime alias checks for this function and then it
considers it legal to vectorise with if-conversion. However, the vectorisation
cost model effectively bans that particular if-conversion by assigning a
ridiculous cost to emulated masked loads and stores.

Originally, each branch of the if statement in the loop body contains three
identical loads. Manually hoisting these allows the loop to be vectorised at
`-O2` (at `-O3` the loop is fully unrolled and that breaks vectorisation).
There's a subroutine in `SimplifyCFG` that does a rudimentary hoisting of
instructions from the two successors of a block, which subroutine does indeed
hoist two of the three loads, but gives up before hoisting the third one.

We'd need a way to make LLVM hoist all three of the loads by itself. `GVNHoist`
can do that, but that pass is disabled by default and has been disabled for a
long, long time.

As an alternative, I was thinking of a simpler hoisting transformation, that
just handles moving instructions from two single-predecessor blocks to their
common predecessor. That could be made reasonably fast, by pairing instructions
by their GVN value number. Limiting hoisting to a predecessor block (for the
most part) would also avoid excessive increase of lifetimes (for the majority of
the case) and would also simplify correctness checks.

I've written such a transformation as a subroutine to `GVN`, it seemed like a
good place for it and is an a similar spirit as various PREs the GVN does. The
Phabricator review is at ⚙ D109760 [GVN] Simple GVN hoist.

Initial benchmarking on Neoverse N1 looks good (speedup, higher is better):

500.perlbench_r 1.13%
502.gcc_r 0.00%
505.mcf_r -1.89%
520.omnetpp_r 0.00%
523.xalancbmk_r 0.00%
525.x264_r 7.67%
531.deepsjeng_r 0.60%
541.leela_r 0.24%
548.exchange2_r 0.00%
557.xz_r 0.75%

Comments?

~chill

Initial response focused on different potential ways of solving this problem. I’ll reply separately on your proposed GVN patch.

I see a number of ways to tackle this:

  • The vectorize already has to do a forward walk through the function to compute predicate masks. As we do that walk, we could keep track for each address of the union of predicate masks for each access. In this case, the union would be all ones, meaning we could use a unconditional access. In general, we can do a single predicated access using the union of the predicate masks. This is a non-trivial costing question of whether explicitly forming the union is worthwhile, but I suspect we can handle obvious cases cheaply.
  • Still in the vectorizer, we can do a forward walk and track accesses encountered along each path. Any address which is accessed along each path to the latch can be done unconditionally. (This is essentially a restricted subset of the former without the generalization to predicate masks.)
  • We could extend SimplifyCFG. You mention this, but don’t get into why we don’t handle the last load. We really should in this case. (Though, after CSE, there should only be two conditional loads in your inner loop? Maybe I’m missing something?)
  • You don’t mention your target processor, but one question to ask is the cost model for a predicated load reasonable? If not, would reducing it to match the actual target cost fix your problem? In particular, we have multiple memory accesses with the same mask here. Does accounting for that difference in lowering cost side step the issue?

Your in-passing mention that -O3 and unrolling breaks vectorization also concerns me. It really shouldn’t. That sounds like a probably issue in the SLP vectorizer, and maybe a pass order issue.

All of the above is simply me brainstorming. :slight_smile:

Philip

Adding to Philip’s list and ideas:

  • Fix GVNHoist. I like your patch because it’s small and concise, but missing in the RFC is a discussion why we don’t try to re-enable GVNHoist. I know it was briefly enabled by default, which was reverted due to correctness (or was it regressions?) problems. But if this belongs in GVNHoist, could this for example be added to GVNHoist, and only this part enabled? Not sure if that’s possible as I haven’t looked at GVNHoist.
    Removing from Philip’s list:

  • SimplifyCFG, because it’s already a monster. :wink: More serious, not sure it’s the right place, GVNHoist looks better.

Ok, this response is focused on the proposed transform. See inline.

This looks generally positive. Do you know why MCF regresses? If not, taking a glance to make sure you understand the interaction is probably warranted.

Realized that the detailed nature of my response may have missed the high order bit. I generally think that doing this transformation in GVN is entirely reasonable, and would be a useful enhancement. That’s why I spent the time to give the detailed feedback on implementation.

Philip

p.s. It’s worth noting that what we’re implementing here is a restricted form of “very busy expressions” which basically the dual of PRE. There’s a bunch of existing literature on the general problem.

p.p.s. Once we’re through with the hoisting variant, there’s a restricted sinking variant too. If we structure the hoisting code right, the sinking should be straight forward.

The vectorize already has to do a forward walk through the function to compute predicate masks. As we do that walk, we could keep track for each address of the union of predicate masks for each access. In this case, the union would be all ones, meaning we could use a unconditional access. In general, we can do a single predicated access using the union of the predicate masks. This is a non-trivial costing question of whether explicitly forming the union is worthwhile, but I suspect we can handle obvious cases cheaply.
Still in the vectorizer, we can do a forward walk and track accesses encountered along each path. Any address which is accessed along each path to the latch can be done unconditionally. (This is essentially a restricted subset of the former without the generalization to predicate masks.)
You don't mention your target processor, but one question to ask is the cost model for a predicated load reasonable? If not, would reducing it to match the actual target cost fix your problem? In particular, we have *multiple* memory accesses with the *same* mask here. Does accounting for that difference in lowering cost side step the issue?

The target architecture is AArch64 and the cost model indeed is
entirely unreasonable, in fact, it on purpose sets a high value for
emulated masked loads/stores so as to disable vectorisation.
For the case I have in hand, though, I set on finding a way to not
have to deal with loads/stores in the first place, instead of handling
them better (by fixing the cost model or otherwise).

We could extend SimplifyCFG. You mention this, but don't get into *why* we don't handle the last load. We really should in this case. (Though, after CSE, there should only be two conditional loads in your inner loop? Maybe I'm missing something?)

Two of the loads are hoisted by
`SimplifyCFGOpt::HoistThenElseCodeToIf`, which is intentionally
limited to hoist identical instructions in identical order. In this
example, the scan of the two blocks
stops at the first pair of different instructions, which happens to be
before the third load.

Your in-passing mention that -O3 and unrolling breaks vectorization also concerns me. It really shouldn't. That sounds like a probably issue in the SLP vectorizer, and maybe a pass order issue.

I would (maybe naively) think that loop vectoriser should be given a
chance before loop unrolling and the SLP vectoriser.

Yes, GVNHoist will do what we need

[hit send too soon]

So, GVNHoist will do the hoisting we want.. However, while technically
possible there a few drawbacks in taking the existing GVNHoist
approach:
* GVNHoist is large and algorithmically complex
* Even if enabled, it won't solve this specific issue - after hoisting
of the loads the loop size falls below some unrolling threshold and
the loop is fully unrolled - instead the loop should be
loop-vectorised, IMHO - so we get a pass order issue.
* After unrolling the SLP vectoriser does not vectorise the loop - so
we have to fix the SLP vectoriser too.

While I think all of these are worthwhile, we have on one hand some
rather simple transformation (the mini-GVNHoist) vs. on the other hand
the actual GVNHoist+pass manager+SLP Vectoriser, which
could easily turn into a time sink with no end in sight.

The SLP vectoriser will fail to vectorise this because it wasn’t taught to emit runtime alias analysis checks. This is being addressed by Florian in https://reviews.llvm.org/D102834. We had many issues with full unrolling removing opportunities for the loop vectoriser, and the SLP missing some tricks. But with D102834 fixed, I expect this loop to be SLP vectorised and most of this pass ordering problems to disappear.

Philip seems happy with this being part of GVN, and I don’t have strong opinions, but GVNHoist seems like the natural place for this. An alternative strategy could thus be to integrate this into GVNHoist, enable it by default but only your simple gvn hoist algorithm (and disable the rest). That would perhaps then be a first step to address Philip’s unhappiness with GVNHoist and restructure and improve it step by step.

The SLP vectoriser will fail to vectorise this because it wasn't taught to emit runtime alias analysis checks.

It likely needs more than that - manually hoisting all the loads and
adding `restrict` still does not vectorise at `-O3`:
https://gcc.godbolt.org/z/z3KjaYv9d

Philip seems happy with this being part of GVN, and I don't have strong opinions, but GVNHoist seems like the natural place for this. An alternative strategy could thus be to integrate this into GVNHoist, enable it by default but only your simple gvn hoist algorithm (and disable the rest). That would perhaps then be a first step to address Philip's unhappiness with GVNHoist and restructure and improve it step by step.

I'm afraid I have to disagree - GVNHoist already does all the hoisting
we need, in that sense the mini-GVNhoist is not something to
integrate, but a simpler alternative.

If we put the issue of dealing with masked loads and stores in
vectorisers aside, I see these approaches:
* improve on the mini-GNVhoist as part of GVN (current proposal)
* bring GVNHoist up to snuff
* create a NewGVNHoist on top of NewGVN

The SLP vectoriser will fail to vectorise this because it wasn’t taught to emit runtime alias analysis checks.

It likely needs more than that - manually hoisting all the loads and
adding restrict still does not vectorise at -O3:
https://gcc.godbolt.org/z/z3KjaYv9d

Hmmm, but it does at -O2 and not too bad from a first quick look. I don’t know what that tells us though (haven’t looked into details), but at O2 it does look like the SLP vectoriser is kicking in?

I’m afraid I have to disagree - GVNHoist already does all the hoisting
we need, in that se​nse the mini-GVNhoist is not something to
integrate, but a simpler alternative.

The way I look at this is that we could have different modes for GVNHoist. Your simple GVN hoist could be the normal/cheap mode: simpler and more compile-time friendly, slightly less optimal results. In aggressive mode (the current stuff in GVNHoist), it could spend more compile time on more optimal results.

If we put the issue of dealing with masked loads and stores in
vectorisers aside, I see these approaches:

  • improve on the mini-GNVhoist as part of GVN (current proposal)
  • bring GVNHoist up to snuff
  • create a NewGVNHoist on top of NewGVN

I am not a big fan of all these pass names like SimplePassName, NewPassName, NewerPassName, MiniPassName, etc, which is another reason why I was trying to see if we could integrate this in GVNHoist somehow (and thus for me this would be the first step to “bring GVNHoist up to snuff”). But if people are happy with this as part of GVN, I am not going to object of course.

Having said that, *conceptually* this pass takes the proper approach by using MemorySSA. Your current implementation has a bunch of complexity to basically do a local renumbering of memory instructions. Some of that can be improved (see below), but if you were doing this with MemorySSA, you wouldn't need to do anything fancy as GVN would have numbered the loads correctly.

Do you mean the NewGVN will number loads and store correctly? The old
GVN will have unique numbers for loads and stores.

Phabricator review is at https://reviews.llvm.org/D109760.

The current structure of algorithm you proposed can probably be simplified. Here's what I'd suggest:

Start by handling all of the non-memory instructions. This should be as simple as constructing a hash map from value number to instruction for one block, and then scanning the other block. If you find the same value number in both blocks, the (by assumption scalar) instruction must exist in both blocks and can be hoisted. This can be reviewed, tested, submitted, etc..

Then add a single pre-pass which "renumbers" all of the memory instructions. You can either use a trivial "no clobbers seen" like EarlyCSE does, or only handle loads where MD->getDependency() is nonlocal. (We already do this query in processLoad, so it should be cached.) The same algorithm can be used as above, all that changes is where the valuenumber comes from. (The "pre-pass" can probably be implemented as a helper function which renumbers memory instructions on demand.)

The issue here is that uniqueness of load numbers propagates to the
users of those loads, so just after we've hoisted and merged a pair of
loads into one instruction we can potentially get the same VN in that
load's users.
I was handling that with a followup iteration, but indeed we could
recursively traverse the users, starting from the load, and recompute
VNs.

I would specifically *not* add an extra layer of iteration here. If you find you need iteration to handle the cases you care about, I suspect we should actually be adjusting the value re-numbering scheme for loads described just above.

A couple of assorted notes:

MemDep should be responsible for handling all aliasing barriers. If it returns a non-local dependence, reordering that instruction to the beginning of the block should be legal w.r.t. memory. Your use of isHoistBarrier is confusing to me. You may be trying to handle speculation safety (i.e. avoid introducing faults), but if so, the special casing around volatile and atomics seems odd? (Also, why allow hosting of the barrier instruction?) I suspect there's a particular case you're trying to handle. I would strongly advice dropping that from the initial patch, then add back the complexity in a separate change which is well motivated with tests, etc..

With regard to volatile and the atomics, I guess I was being too
conservartive trying to not reorder *any* instructions above them. I
can remove those conditions. The other condition is to not
speculatively execute instructions by moving them across an
instruction that may throw, e.g. with

    VN 9: br i1 %tobool, label %if.then, label %if.else
    if.then:
    VN 11: call void @h()
    VN 7: %0 = sdiv i32 %a, %b
     ...
    if.else:
    VN 7: %1 = sdiv i32 %a, %b
    ...

we don't want to end up with:

    VN 7: %0 = sdiv i32 %a, %b
    br i1 %tobool, label %if.then, label %if.else
    if.then:
    VN 11: call void @h()
     ...
    if.else:
    VN 7: %1 = sdiv i32 %a, %b
    ...

but with (`h` is `readnone`):

    VN 9: br i1 %tobool, label %if.then, label %if.else
    if.then:
    VN 11: call void @h()
    VN 7: %0 = sdiv i32 %a, %b
    ...
    if.else:
    VN 11: call void @h()
    VN 7: %1 = sdiv i32 %a, %b
    ...

we could move both the call and the division, while preserving their
order, which also answers "Also, why allow hosting of the barrier
instruction?")

So, the subsequent iterations of the algorithm were designed to:
a) handle renumbering of loads and their users (which could possibly
be dealt with as described above)
b) to prevent all reordering across volatiles and atomics (I'll drop
that and prevent reordering only when MemDep says there is a
dependency).
c) detect when hoisting an instruction turns it from a local
dependency into a non-local one, thus potentially allowing more
hoisting; I'll see about explicitly checking if two matching
instructions have a local dependency on
a couple if instructions that are themselves paired for hoisting.
d) prevent speculation across throwing instructions; I don't have a
nice non-iterative solution for that, e.g. in the example above there
is no dependency between the call and the division.

In terms of the actual merge logic, I'd suggest implementing it as hoist-one, then CSE. That would allow you to reuse the existing patching logic and avoid reimplementing a bunch of complexity.

I'm not sure I understand this. Currently I move one instruction, then
replace all uses of the other with the first one. Do you mean that or
something else?

Thanks for the comments and the suggestions, I now have a bunch of new
ideas to try!

Thanks and best regards,
~chill

Besides alias runtime checks for SLP, there’s an additional problem: SLP doesn’t really do if-conversion at the moment for this case and instead relies on SimplifyCFG. SimplifyCFG’s if-conversion doesn’t trigger, because there are too many instructions to duplicate. If we give SimplifyCFG more leeway (-mllvm -two-entry-phi-node-folding-threshold=10), it also gets vectorized with -O3: https://gcc.godbolt.org/z/qarnMbo5M

Ideally the SLP vectorizer would be able to perform vector if conversion directly when profitable. That way, we will also vectorize in cases where LV doesn’t kick in (e.g. due to very low trip counts) or if there’s no loop to start with.

Cheers,
Florian

Your argument here actually makes me less convinced. :slight_smile:

In general, we should not be duplicating transform functionality just because of pass ordering issues, or deficiencies in other passes. We should be fixing those issues, since they are likely common to other workloads as well.

If my opinion of GVNHoist was anything thing other than terrible, this would probably have convinced my that we should pursue the GVNHoist option.

Philip

The SLP vectoriser will fail to vectorise this because it wasn't taught to emit runtime alias analysis checks. This is being addressed by Florian in https://reviews.llvm.org/D102834. We had many issues with full unrolling removing opportunities for the loop vectoriser, and the SLP missing some tricks. But with D102834 <https://reviews.llvm.org/D102834> fixed, I expect this loop to be SLP vectorised and most of this pass ordering problems to disappear.

Philip seems happy with this being part of GVN, and I don't have strong opinions, but GVNHoist seems like the natural place for this. An alternative strategy could thus be to integrate this into GVNHoist, enable it by default but only your simple gvn hoist algorithm (and disable the rest). That would perhaps then be a first step to address Philip's unhappiness with GVNHoist and restructure and improve it step by step.

+1 to this point. This would be a perfectly reasonable way to implement the MemorySSA based approach in a shippable way. I'm not opposed to the non-memoryssa variant in (old) GVN, but this would be fine too.

Philip

Having said that, *conceptually* this pass takes the proper approach by using MemorySSA. Your current implementation has a bunch of complexity to basically do a local renumbering of memory instructions. Some of that can be improved (see below), but if you were doing this with MemorySSA, you wouldn't need to do anything fancy as GVN would have numbered the loads correctly.

Do you mean the NewGVN will number loads and store correctly? The old
GVN will have unique numbers for loads and stores.

It should. This is the whole idea behind a memoryssa based GVN.

Phabricator review is at https://reviews.llvm.org/D109760.

The current structure of algorithm you proposed can probably be simplified. Here's what I'd suggest:

Start by handling all of the non-memory instructions. This should be as simple as constructing a hash map from value number to instruction for one block, and then scanning the other block. If you find the same value number in both blocks, the (by assumption scalar) instruction must exist in both blocks and can be hoisted. This can be reviewed, tested, submitted, etc..

Then add a single pre-pass which "renumbers" all of the memory instructions. You can either use a trivial "no clobbers seen" like EarlyCSE does, or only handle loads where MD->getDependency() is nonlocal. (We already do this query in processLoad, so it should be cached.) The same algorithm can be used as above, all that changes is where the valuenumber comes from. (The "pre-pass" can probably be implemented as a helper function which renumbers memory instructions on demand.)

The issue here is that uniqueness of load numbers propagates to the
users of those loads, so just after we've hoisted and merged a pair of
loads into one instruction we can potentially get the same VN in that
load's users.

Your missing my point. The renumbering scheme would explicitly consider the incoming memory state to the load as an input to the hash. It would not uniquely number the loads. As a result, two loads (in either different blocks or even the same one) would be assigned the same value number if all operands (including memory state!) were the same.

I was handling that with a followup iteration, but indeed we could
recursively traverse the users, starting from the load, and recompute
VNs.

I would specifically *not* add an extra layer of iteration here. If you find you need iteration to handle the cases you care about, I suspect we should actually be adjusting the value re-numbering scheme for loads described just above.

A couple of assorted notes:

MemDep should be responsible for handling all aliasing barriers. If it returns a non-local dependence, reordering that instruction to the beginning of the block should be legal w.r.t. memory. Your use of isHoistBarrier is confusing to me. You may be trying to handle speculation safety (i.e. avoid introducing faults), but if so, the special casing around volatile and atomics seems odd? (Also, why allow hosting of the barrier instruction?) I suspect there's a particular case you're trying to handle. I would strongly advice dropping that from the initial patch, then add back the complexity in a separate change which is well motivated with tests, etc..

With regard to volatile and the atomics, I guess I was being too
conservartive trying to not reorder *any* instructions above them. I
can remove those conditions. The other condition is to not
speculatively execute instructions by moving them across an
instruction that may throw, e.g. with

     VN 9: br i1 %tobool, label %if.then, label %if.else
     if.then:
     VN 11: call void @h()
     VN 7: %0 = sdiv i32 %a, %b
      ...
     if.else:
     VN 7: %1 = sdiv i32 %a, %b
     ...

we don't want to end up with:

     VN 7: %0 = sdiv i32 %a, %b
     br i1 %tobool, label %if.then, label %if.else
     if.then:
     VN 11: call void @h()
      ...
     if.else:
     VN 7: %1 = sdiv i32 %a, %b
     ...

but with (`h` is `readnone`):

     VN 9: br i1 %tobool, label %if.then, label %if.else
     if.then:
     VN 11: call void @h()
     VN 7: %0 = sdiv i32 %a, %b
     ...
     if.else:
     VN 11: call void @h()
     VN 7: %1 = sdiv i32 %a, %b
     ...

we could move both the call and the division, while preserving their
order, which also answers "Also, why allow hosting of the barrier
instruction?")

So, the subsequent iterations of the algorithm were designed to:
a) handle renumbering of loads and their users (which could possibly
be dealt with as described above)
b) to prevent all reordering across volatiles and atomics (I'll drop
that and prevent reordering only when MemDep says there is a
dependency).
c) detect when hoisting an instruction turns it from a local
dependency into a non-local one, thus potentially allowing more
hoisting; I'll see about explicitly checking if two matching
instructions have a local dependency on
a couple if instructions that are themselves paired for hoisting.

The numbering scheme should handle this.

d) prevent speculation across throwing instructions; I don't have a
nice non-iterative solution for that, e.g. in the example above there
is no dependency between the call and the division.

As you walk forward, if you encounter an instruction not guaranteed to transfer execution, only hash instructions which are safe to speculate into the header. You can no longer use guarantee-to-execute fault safety past that point.

In terms of the actual merge logic, I'd suggest implementing it as hoist-one, then CSE. That would allow you to reuse the existing patching logic and avoid reimplementing a bunch of complexity.

I'm not sure I understand this. Currently I move one instruction, then
replace all uses of the other with the first one. Do you mean that or
something else?

I mean specifically, you should be able to reuse the patchReplacement logic rather than reimplementing all of the flag merging logic.

Oh, I see, cheers!

Apologies for the delayed reply on this issue.

The status of NewGVN is uncertain at this point due to correctness issues with the pass.
The idea of adding the GVN Hoist functionality makes sense, but I’d like a discussion on how to plan this longer term. I added a note about this in https://reviews.llvm.org/D110822.

Alina

[back to the mailing list, for more(?) exposure]

I added a note about this in https://reviews.llvm.org/D110822.

So, stepping back, I'd like to raise the question of how this change is going to be used if implemented.
In the current default pipeline MemorySSA is not available where GVN is added. This is forcing the
computation of MemorySSA before GVN.

Having MemorySSA computed before GVN may be good in the LTO pipeline (GVN already preserves the
analysis), as the following passes require it (MemCpyOpt and DSE), but in the function simplification pipeline
we'd end up with GVN computing, using and preserving two analysis (MemorySSA and MemDepAnalysis) with
overlapping goals.

The status for switching to NewGVN is uncertain at this point, but porting GVN to switch over from
MemDepAnalysis to MemorySSA may be an option, at least in the interim. This will lift the issue of having two
analysis and also provide the availability of MemorySSA here.

Is this something you'd be interested in exploring?

It is possible. We (the Arm team working on these things) would like to have a reasonable idea[1]
of a couple of things.

First, will this get us anywhere? Does anyone can think of any other objections
(provided, of course, the patch is of sufficient quality) against merging this and enabling it by default, including
computing MemorySSA?
AFAIK, GVNHoist was disabled due to some benchmark regressions, unfortunately I wasn't able to find a trace about
this decision, does anyone know anything more about it? Would anyone interested be able to apply
https://reviews.llvm.org/D111418 (5 patches) and check and see if there are unacceptable regressions,
that we can try to resolve?

Second, what are we getting into, if we decide to migrate GVN.cpp to MemorySSA? Would that be
a couple of weeks or a multi-year project? Has anyone already thought about approaches to doing that?

I spent maybe a total of three days looking at GVN/MemorySSA/MemoryDependenceAnalysis
and the transition does not look straightforward or obvious, more like re-implementing (parts of)
MemoryDependenceAnalysis. I mean, it looks doable and not *that* much work, but more likely than not
I'm overlooking something.

For example, looking at the dependencies of loads (if I'm not mistaken that's the only kind relevant to GVN)
the MemoryDependencyAnalysis would return other loads as "dependencies", depending on aliasing,
and also `alloca`s. This does not have a direct equivalent in MemorySSA. I'm thinking of something along
the lines of: get the clobbering def for a load, get all its uses, from this list, remove uses, post-dominated
by other uses (as long as the post-dominating one is not a MayAlias), and somehow do this
in an efficient manner - that'll hopefully get the same set of dependencies (local or not) as the ones obtained from
the MemoryDependencyAnalysis, continue from there as before.

Any other ideas?

~chill

[1] by no means a guarantee, we're just looking to asses the risks going one or another way

[back to the mailing list, for more(?) exposure]

From: Alina Sbirlea <asbirlea@google.com>

I added a note about this in https://reviews.llvm.org/D110822.

So, stepping back, I’d like to raise the question of how this change is going to be used if implemented.
In the current default pipeline MemorySSA is not available where GVN is added. This is forcing the
computation of MemorySSA before GVN.

Having MemorySSA computed before GVN may be good in the LTO pipeline (GVN already preserves the
analysis), as the following passes require it (MemCpyOpt and DSE), but in the function simplification pipeline
we’d end up with GVN computing, using and preserving two analysis (MemorySSA and MemDepAnalysis) with
overlapping goals.

The status for switching to NewGVN is uncertain at this point, but porting GVN to switch over from
MemDepAnalysis to MemorySSA may be an option, at least in the interim. This will lift the issue of having two
analysis and also provide the availability of MemorySSA here.

Is this something you’d be interested in exploring?

It is possible. We (the Arm team working on these things) would like to have a reasonable idea[1]
of a couple of things.

First, will this get us anywhere? Does anyone can think of any other objections
(provided, of course, the patch is of sufficient quality) against merging this and enabling it by default, including
computing MemorySSA?

My team’s focus is on improving the infrastructure in LLVM without negatively impacting performance. One of the top priorities at the moment is removing the usages of MemDepAnalysis - GVN is the last of these. I’ve been looking into fixing NewGVN in order to get it to a stage of being viable to replace the current GVN. At this point, there are still correctness issues to resolve, and even with those addressed it will not reach parity with GVN due to NewGVN not doing PRE.

Given all this and your involvement in adding GVNHoist to the current GVN, it seems worth considering whether investment into the current GVN makes more sense at this time.
It’s still possible that NewGVN would give better optimizations and/or compile times, but if GVN were to only rely on MemorySSA for its decisions, then we’d be enabling that by default first. Computing MemorySSA will not be an issue for adding GVN Hoist then, since it would always be available.
Again, my only objection to adding GVN Hoist then is providing a reasonable balance on what are the gains when adding the new optimization vs the compile time incurred.

AFAIK, GVNHoist was disabled due to some benchmark regressions, unfortunately I wasn’t able to find a trace about
this decision, does anyone know anything more about it? Would anyone interested be able to apply
https://reviews.llvm.org/D111418 (5 patches) and check and see if there are unacceptable regressions,
that we can try to resolve?

Could you run these through the compiler-tracker as an initial testing stage? (+nikic for support here)
(I expect a higher increase in compile time due to computing MSSA that would be mitigated if GVN is first switched to only use MemorySSA)

Second, what are we getting into, if we decide to migrate GVN.cpp to MemorySSA? Would that be
a couple of weeks or a multi-year project? Has anyone already thought about approaches to doing that?

I’d estimate a few months of on and off work (or less than a couple of months if more dedicated work). The DSE transition was a similar process; looking at patch history, from the first patch variant to the flag being flipped there were ~7 months (+fhahn for additional feedback here).

I spent maybe a total of three days looking at GVN/MemorySSA/MemoryDependenceAnalysis
and the transition does not look straightforward or obvious, more like re-implementing (parts of)
MemoryDependenceAnalysis. I mean, it looks doable and not that much work, but more likely than not
I’m overlooking something.

For example, looking at the dependencies of loads (if I’m not mistaken that’s the only kind relevant to GVN)
the MemoryDependencyAnalysis would return other loads as “dependencies”, depending on aliasing,
and also allocas. This does not have a direct equivalent in MemorySSA. I’m thinking of something along
the lines of: get the clobbering def for a load, get all its uses, from this list, remove uses, post-dominated
by other uses (as long as the post-dominating one is not a MayAlias), and somehow do this
in an efficient manner - that’ll hopefully get the same set of dependencies (local or not) as the ones obtained from
the MemoryDependencyAnalysis, continue from there as before.

That’s the general idea, yes. Similar traversals were added in DSE, and it may make sense to reuse (potentially factor out) some of those.

I don’t think you’re overlooking anything. We just didn’t focus on this until now because the general agreement has been we’d move over to NewGVN instead (vs for DSE there wasn’t another pass using MemorySSA and doing the same kind of transformation).

Any other ideas?

~chill

[1] by no means a guarantee, we’re just looking to asses the risks going one or another way

Understood and it makes sense! Hope I’ve clarified some of your questions.

Best,
Alina