Reordering of load/stores using MemorySSA

Hello all,

It seems that it is insufficient to
rely on MemorySSA to correctly reorder load/stores.

For example, we have this following code:

v = load q
store 10, p

=>

store 10, p
v = load q // This is incorrect if p and q may alias

In the MemorySSA representation however,
there’s no syntactic dependency between load/store,
because MemorySSA before transformation would look like this:

USE(M0) // corresponding to v = load q
M1 = DEF(M0) // corresponding to store 10, p

Reordering these two nodes does not break syntactic dependency,
hence additional flow-sensitive analysis is needed to check whether reordering is valid.
To my understanding, GVNHoist seems to do that:
Memory operations in basic blocks are checked by
GVNHoist::hasMemoryUse().

If USE nodes of MemorySSA also have left-hand side, we can correctly
represent this kind of reorderability IMHO.

For the example above:

U1 = USE(M0) // corresponding to v = load q
M1 = DEF(M0, U1) // corresponding to store 10, p

Now there is a syntactic dependency between M1 and U1, hence
we can syntactically check whether reordering is allowed without any flow-sensitive analysis.
If the store and load don’t alias, we can remove U1 from DEF:

U1 = USE(M0) // corresponding to v = load q
M1 = DEF(M0) // corresponding to store 10, p

Then, the reordering is valid.

One issue is regarding PHI node: possible solution is to make
two kinds of PHIS - PHIs for DEFs & PHIS for USEs, so there are
at most 2 PHIs.

Does this idea seem to make sense? Any thoughts?

Best Regards,
Juneyoung Lee

Hello all,

It seems that it is insufficient to
rely on MemorySSA to correctly reorder load/stores.

It depends on what you are trying to do.

For example, we have this following code:

v = load q
store 10, p

=>

store 10, p
v = load q // This is incorrect if p and q may alias

In the MemorySSA representation however,
there's no syntactic dependency between load/store,
because MemorySSA before transformation would look like this:

USE(M0) // corresponding to `v = load q`
M1 = DEF(M0) // corresponding to `store 10, p`

This is correct, MemorySSA, like SSA, represents upwards dependencies
and not downwards ones.

Much like there is SSU and SSI, and SSA/SSU/SSI on the reverse flow
graph, there are analogues for MemorySSA that we do not implement.

Reordering these two nodes does not break syntactic dependency,

This is true in regular SSA as well :slight_smile:
It just happens that memory is different due to aliasing.

hence additional flow-sensitive analysis is needed to check whether reordering is valid.

To my understanding, GVNHoist seems to do that:
Memory operations in basic blocks are checked by
GVNHoist::hasMemoryUse().

Correct.

If USE nodes of MemorySSA also have left-hand side, we can correctly
represent this kind of reorderability IMHO.

For the example above:

U1 = USE(M0) // corresponding to `v = load q`
M1 = DEF(M0, U1) // corresponding to `store 10, p`

It is not this simple.
You also need to avoid multiple reaching uses for a single def.
IE
U0 = Use(M0) // corresponding to `v = load q`
U1 = Use(M1) // corresponding to `t = load r` where r aliases q aliases p
M1 = DEF(M0, ???)

??? must be U0, U1 above.

If you have the analogue of phis, you will require that analogue to
merge the multiple reaching uses, and you may have multiple merges per
block.
IE
U2 = merge(U0, U1)
M1 = DEF (M0, U2)

Now there is a syntactic dependency between M1 and U1, hence
we can syntactically check whether reordering is allowed without any flow-sensitive analysis.

What's the goal?
Much like SSA is meant to speed up certain dataflow problems, so is
this. Something would have to be slow to make this worth it.

MemorySSA in particular is deliberately an overapproximation in LLVM
(and GCC), because we discovered exactness is super-expensive,
impossible, and going as far as you can buys you almost nothing in
practice.
However, it makes for a much faster memorydependenceanalysis (which in
llvm is also only upwards and not downwards).

Here, it's not clear that GVNHoist is slow enough to make building an
entire ReverseMemorySSA form worth it.
In particular, building reverse MemorySSA will double the number of
alias checks in most cases.
You would almost certainly have to fix the BasicAA caching issues and
finally split BasicAA or something to make it viable.
(MemorySSA already pays a high price for BasicAA being super slow).

If the store and load don't alias, we can remove U1 from DEF:

U1 = USE(M0) // corresponding to `v = load q`
M1 = DEF(M0) // corresponding to `store 10, p`

Then, the reordering is valid.

One issue is regarding PHI node: possible solution is to make
two kinds of PHIS - PHIs for DEFs & PHIS for USEs, so there are
at most 2 PHIs.

Does this idea seem to make sense? Any thoughts?

It's certainly correct, i have considered doing it before but never
had something to test out whether it was really worth the cost or not.

One previous blocker was that it requires a correct postdomtree to be
able to do placement of use phis.
That should now be solved.

I suspect you will find the cost is not worth it without a meaningful
motivating optimization (like store PRE or something)

Hello Daniel,

Thank you for the reply.
My impression was that embedding such information into MemorySSA
would be helpful for shorter compilation time
as well as people who needs this kind of information for writing optimizations,
but building time of MemorySSA also should have been considered.

I suspect you will find the cost is not worth it without a meaningful
motivating optimization (like store PRE or something)

I think implementing store PRE for very simple cases seems interesting
(not only for MemorySSA, but the optimization itself as well).
I can give it a try to implement it if it is straightforward.
From which file should I start?

By the way, while looking through memory related optimizations, I found this patch -
https://reviews.llvm.org/D5422 .
Would it make sense if store PRE deals with atomic operations
as well? I wonder whether people are very interested in
optimizing atomic operations / there are more patches like this.

Thank you,
Juneyoung Lee

Hello Daniel,

Thank you for the reply.
My impression was that embedding such information into MemorySSA
would be helpful for shorter compilation time
as well as people who needs this kind of information for writing optimizations,
but building time of MemorySSA also should have been considered.

Right now the building time of MemorySSA in LLVM has high cost because
AA in LLVM is so slow.
(It has no meaningful cost when built in GCC)

It would be ... a significant undertaking to speed up AA in LLVM.

(The underlying problem is that BasicAA is completely stateless, but
does a tremendous amount of work.
Thus it will walk over all the IR, but has no caching)

> I suspect you will find the cost is not worth it without a meaningful
> motivating optimization (like store PRE or something)

I think implementing store PRE for very simple cases seems interesting
(not only for MemorySSA, but the optimization itself as well).
I can give it a try to implement it if it is straightforward.
From which file should I start?

There have been PDSE/etc patches posted to llvm-dev before that are
probably a good start.

There have been PDSE/etc patches posted to llvm-dev before that are
probably a good start.

Thanks for the info! :slight_smile:

Juneyoung Lee