LICM/store-aliasing of global loads

Our frontend can guarantee that loads from globals are rematerializable and do not alias with any stores in any function in the given module. We'd like the optimization passes (and ideally the register allocator as well) to be able to use this fact. The globals are not constant "forever" but are constant during the calling of any given function in the module.

There seem to be two major ways to expose this to the optimization passes and code gen:
- build a custom alias analysis pass that indicates that these loads never alias with any stores in a given function
- declare these globals as external constants within the module

The former should give optimizations like LICM the freedom to move these loads around, allow them to be CSE'd, etc.

The latter should technically allow the same freedom to these optimizations, but doesn't currently seem to. Furthermore, the latter should give the RA enough information to rematerialize these loads instead of spilling them if necessary.

Below is a simple example module that illustrates this. It's just a memcpy loop copying between two external arrays. With unmodified TOT, opt -basicaa -licm for example will not move the invariant loads of @b and @a (into %tmp3 and %tmp5) out of the body of the for loop.

If I apply the patch found further down, LICM moves the loads out (as expected), but of course this is a fairly specific fix.

What's the right way to handle this? Should Basic AA handle this case? Will the RA be aware that it can remat these loads or do I need to do something else to allow it to know this? Will the scheduler be aware that it can reorder them?

Obviously I can also move the loads to the entry block of the function, but that does not address the RA/scheduling issues and is difficult to do in general due to some additional semantics in our frontend.

Thanks!

Stefanus

=== Example ===
@b = external constant float*
@a = external constant float*

define void @test(i32 %count) {
entry:
        br label %forcond

forcond: ; preds = %forinc, %entry
        %i.0 = phi i32 [ 0, %entry ], [ %inc, %forinc ] ; <i32> [#uses=4]
        %cmp = icmp ult i32 %i.0, %count ; <i1> [#uses=1]
        br i1 %cmp, label %forbody, label %afterfor

forbody: ; preds = %forcond
        %tmp3 = load float** @b ; <float*> [#uses=1]
        %arrayidx = getelementptr float* %tmp3, i32 %i.0 ; <float*> [#uses=1]
        %tmp5 = load float** @a ; <float*> [#uses=1]
        %arrayidx6 = getelementptr float* %tmp5, i32 %i.0 ; <float*> [#uses=1]
        %tmp7 = load float* %arrayidx6 ; <float> [#uses=1]
        store float %tmp7, float* %arrayidx
        br label %forinc

forinc: ; preds = %forbody
        %inc = add i32 %i.0, 1 ; <i32> [#uses=1]
        br label %forcond

afterfor: ; preds = %forcond
        ret void
}

=== Patch ===
Index: lib/Transforms/Scalar/LICM.cpp

The latter should technically allow the same freedom to these
optimizations, but doesn't currently seem to. Furthermore, the latter
should give the RA enough information to rematerialize these loads
instead of spilling them if necessary.

Yes... the issue is roughly that alias analysis works between
addresses, not instructions, so it can't tell that a store can't alias
a load from a constant global.

If I apply the patch found further down, LICM moves the loads out (as
expected), but of course this is a fairly specific fix.

Yes... the attached patch looks correct.

What's the right way to handle this? Should Basic AA handle this case?
Will the RA be aware that it can remat these loads or do I need to do
something else to allow it to know this? Will the scheduler be aware
that it can reorder them?

The best way to go about this is probably case-by-case, at least at
the moment... the const-ness of a global provides information that
isn't otherwise possible to express. Remat can probably be extended
to be aware of const-ness when remat-ing loads.

This will probably be fixed properly once LLVM gets type-based alias
information.

-Eli

Our frontend can guarantee that loads from globals are
rematerializable and do not alias with any stores in any function in
the given module. We'd like the optimization passes (and ideally the
register allocator as well) to be able to use this fact. The globals
are not constant "forever" but are constant during the calling of any
given function in the module.

There seem to be two major ways to expose this to the optimization
passes and code gen:
- build a custom alias analysis pass that indicates that these loads
never alias with any stores in a given function
- declare these globals as external constants within the module

If you can convince yourself that no interprocedural optimization
will ever get in trouble, the second approach here sounds reasonable
and simpler. But if the values aren't really constant, it may be
difficult to be sure. Building a custom alias analysis is reasonable
too.

The former should give optimizations like LICM the freedom to move
these loads around, allow them to be CSE'd, etc.

The latter should technically allow the same freedom to these
optimizations, but doesn't currently seem to. Furthermore, the latter
should give the RA enough information to rematerialize these loads
instead of spilling them if necessary.

Below is a simple example module that illustrates this. It's just a
memcpy loop copying between two external arrays. With unmodified TOT,
opt -basicaa -licm for example will not move the invariant loads of @b
and @a (into %tmp3 and %tmp5) out of the body of the for loop.

Good catch!

One way to fix this would be to have AliasSetTracker pretend that
pointers to constant memory never alias anything. That's a little
sneaky though, so offhand I think an approach such as what's in
your patch is better.

If I apply the patch found further down, LICM moves the loads out (as
expected), but of course this is a fairly specific fix.

Slightly better than checking for GlobalVariabls directly
is to call the AliasAnalysis' pointsToConstantMemory method.
BasicAliasAnalysis' implementation of that does exactly the same thing,
checking for constant GlobalVariables, but it would allow alias
analyses to do more sophisticated things. Could you submit a patch
for this?

What's the right way to handle this? Should Basic AA handle this case?
Will the RA be aware that it can remat these loads or do I need to do
something else to allow it to know this? Will the scheduler be aware
that it can reorder them?

It would be nice to have an AA that's smart enough to do things
like this. However for now, having code use
AliasAnalysis::pointsToConstantMemory should cover many of the
obvious cases.

Obviously I can also move the loads to the entry block of the
function, but that does not address the RA/scheduling issues and is
difficult to do in general due to some additional semantics in our
frontend.

In the scheduling department, LLVM is not yet using any alias
information. You can experiment with the -combiner-alias-analysis and
-combiner-global-alias-analysis options, which use AliasAnalysis
queries and do a pretty good job, but aren't very efficient and not
very widely tested. Ideally we'd like to do something better here.

For register allocation, LLVM currently has some simple hooks which
individual targets use to specify which loads are rematerializable.
See isReallyTriviallyReMaterializable. Currently this code is all
target-specific and doesn't use AliasAnalysis information, but I
think it could be reasonably generalized to use the new
MachineMemOperand information to be less target-dependent and to
make at least AliasAnalysis::pointsToConstantMemory queries.

Dan

Hi,

One way to fix this would be to have AliasSetTracker pretend that
pointers to constant memory never alias anything. That's a little
sneaky though, ...

on the other hand it is simple and (presumably) effective.
Do you think it really could cause trouble?

Ciao,

Duncan.

- build a custom alias analysis pass that indicates that these loads
never alias with any stores in a given function
- declare these globals as external constants within the module

If you can convince yourself that no interprocedural optimization
will ever get in trouble, the second approach here sounds reasonable
and simpler. But if the values aren't really constant, it may be
difficult to be sure.

Yes, we definitely can be sure.

Building a custom alias analysis is reasonable too.

Right, and is helpful for other reasons.

One way to fix this would be to have AliasSetTracker pretend that
pointers to constant memory never alias anything. That's a little
sneaky though, so offhand I think an approach such as what's in
your patch is better.

That does seem a little... unethical? :slight_smile:

At the very least it seems this should still return must-alias results correctly, otherwise this could hurt optimizations (I assume).

I was going to suggest that at least getModRefInfo should handle this for stores, but it looks like it already does.

If I apply the patch found further down, LICM moves the loads out (as
expected), but of course this is a fairly specific fix.

Slightly better than checking for GlobalVariabls directly
is to call the AliasAnalysis' pointsToConstantMemory method.
BasicAliasAnalysis' implementation of that does exactly the same thing,
checking for constant GlobalVariables, but it would allow alias
analyses to do more sophisticated things. Could you submit a patch
for this?

Ah, I was going to ask if there was such a method.

I'll submit a patch once I'm done testing it.

What's the right way to handle this? Should Basic AA handle this case?
Will the RA be aware that it can remat these loads or do I need to do
something else to allow it to know this? Will the scheduler be aware
that it can reorder them?

It would be nice to have an AA that's smart enough to do things
like this. However for now, having code use
AliasAnalysis::pointsToConstantMemory should cover many of the
obvious cases.

OK, are there any other obvious places this should go? I expect optimizations that use getModRefInfo on stores rather than alias sets won't need any changes.

Obviously I can also move the loads to the entry block of the
function, but that does not address the RA/scheduling issues and is
difficult to do in general due to some additional semantics in our
frontend.

In the scheduling department, LLVM is not yet using any alias
information. You can experiment with the -combiner-alias-analysis and
-combiner-global-alias-analysis options, which use AliasAnalysis
queries and do a pretty good job, but aren't very efficient and not
very widely tested. Ideally we'd like to do something better here.

Could you expand on this a bit (or point me to past discussions/...)? This can be pretty important on architectures like Cell SPU.

For register allocation, LLVM currently has some simple hooks which
individual targets use to specify which loads are rematerializable.
See isReallyTriviallyReMaterializable. Currently this code is all
target-specific and doesn't use AliasAnalysis information, but I
think it could be reasonably generalized to use the new
MachineMemOperand information to be less target-dependent and to
make at least AliasAnalysis::pointsToConstantMemory queries.

OK, I will have a look. I assume the reference to M_REMATERIALIZABLE in the comment for it should really be TID::Rematerializable? I also noticed that the documentation for TargetInstrDesc::isRematerializable() says "This flag is deprecated, please don't use it anymore" -- could you explain what replaces it?

Stefanus

Ah, I didn't realize there was such a method; that's definitely the way to go.

-Eli

Right now, LICM is the only user of the AliasSetTracker abstraction.
For the immediate problem, the fix in LICM looks simpler :-).

For the future, it is theoretically possible to write an AliasSetTracker
client that would break if given pretended results. Unless there's a
need, I'd prefer to avoid it.

Dan

\

For register allocation, LLVM currently has some simple hooks which
individual targets use to specify which loads are rematerializable.
See isReallyTriviallyReMaterializable. Currently this code is all
target-specific and doesn't use AliasAnalysis information, but I
think it could be reasonably generalized to use the new
MachineMemOperand information to be less target-dependent and to
make at least AliasAnalysis::pointsToConstantMemory queries.

OK, I will have a look. I assume the reference to M_REMATERIALIZABLE
in the comment for it should really be TID::Rematerializable? I also
noticed that the documentation for
TargetInstrDesc::isRematerializable() says "This flag is deprecated,
please don't use it anymore" -- could you explain what replaces it?

It's not *yet* deprecated. The intent is to replace it with hasSideEffects, mayHaveSideEffects, neverHasSideEffects, and more analysis. It hasn't happened yet.

Evan

If I apply the patch found further down, LICM moves the loads out (as
expected), but of course this is a fairly specific fix.

Slightly better than checking for GlobalVariabls directly
is to call the AliasAnalysis' pointsToConstantMemory method.
BasicAliasAnalysis' implementation of that does exactly the same
thing,
checking for constant GlobalVariables, but it would allow alias
analyses to do more sophisticated things. Could you submit a patch
for this?

Ah, I was going to ask if there was such a method.

I'll submit a patch once I'm done testing it.

Thanks!

What's the right way to handle this? Should Basic AA handle this
case?
Will the RA be aware that it can remat these loads or do I need to do
something else to allow it to know this? Will the scheduler be aware
that it can reorder them?

It would be nice to have an AA that's smart enough to do things
like this. However for now, having code use
AliasAnalysis::pointsToConstantMemory should cover many of the
obvious cases.

OK, are there any other obvious places this should go? I expect
optimizations that use getModRefInfo on stores rather than alias sets
won't need any changes.

At a glance, it looks like DeadStoreElimination and
MemoryDependenceAnalysis could also use this.

Obviously I can also move the loads to the entry block of the
function, but that does not address the RA/scheduling issues and is
difficult to do in general due to some additional semantics in our
frontend.

In the scheduling department, LLVM is not yet using any alias
information. You can experiment with the -combiner-alias-analysis and
-combiner-global-alias-analysis options, which use AliasAnalysis
queries and do a pretty good job, but aren't very efficient and not
very widely tested. Ideally we'd like to do something better here.

Could you expand on this a bit (or point me to past discussions/...)?

The codegen dependency graphs and the scheduler are capable of
working with precise memory dependencies, but right now they're
being given conservative dependencies. Take a look at the
-view-sunit-dags graphs on code with lots of simple loads and
stores, and you'll often see tall chains of dashed blue lines.
That's what we'd like to avoid :-).

This can be fixed by either teaching SelectionDAGISel.cpp how
to use AliasAnalysis information to compute better dependencies
when building the graph, or by teaching DAGCombiner.cpp how to
use AliasAnalysis information to optimize the dependencies.
The options I mentioned above take the latter approach, though
they probably have room for improvement.

One other factor here is that some of the transformations that
LoopStrengthReduction and CodeGenPrepare do lower getelementptrs to
ptrtoint+arithmetic+inttoptr, and this interferes with
BasicAliasAnalysis. There's been some discussion of this on
llvmdev, but I don't have a link handy.

This can be pretty important on architectures like Cell SPU.

I agree that it's an important capability. We're gradually working
on adding the requisite pieces, and we welcome help :-).

For register allocation, LLVM currently has some simple hooks which
individual targets use to specify which loads are rematerializable.
See isReallyTriviallyReMaterializable. Currently this code is all
target-specific and doesn't use AliasAnalysis information, but I
think it could be reasonably generalized to use the new
MachineMemOperand information to be less target-dependent and to
make at least AliasAnalysis::pointsToConstantMemory queries.

OK, I will have a look. I assume the reference to M_REMATERIALIZABLE
in the comment for it should really be TID::Rematerializable? I also
noticed that the documentation for
TargetInstrDesc::isRematerializable() says "This flag is deprecated,
please don't use it anymore" -- could you explain what replaces it?

I think the idea is to move towards having more complete instruction
descriptions so that the rematerializability of an instruction can be
inferred instead of being an explicit flag, but I don't know what
stage this transition is in.

Dan

Since the second has already been examined down-thread, I thought I'd talk about the first one.

Instead of implementing a whole new alias analysis, you can build on the lib/Analysis/LibCallAliasAnalysis.cpp/LibCallSemantics.cpp stuff. This provides a slightly higher level API to define that "objects" exist (with custom predicates to determine whether a pointer provably points to the 'object') and then define their relations to standard functions that can't be marked readnone.

This is mostly useful for things like libm calls, which don't modify anything except errno, and errno has a very weird and target-specific description. If your language has similar things, it could be useful for that.

-Chris

Hello Stefanus,

Your LICM patch is fine on its own, but it turns out that since
CodeGen can't always rematerialize loads of GlobalVariable constants,
it causes some performance regressions. I've disabled this feature by
default for now. You can enable it with the
-enable-licm-constant-variables option.

I didn't want to do this :-), so I made an attempt to fix CodeGen by
adding some basic remat improvements. They help out in some cases,
however it turns out that they're not sufficient in the face of
PIC-mode complications. It looks this won't be easily fixable until
after some of the major infrastructure projects are done.

Along the way, I noticed that AliasAnalsis::pointsToConstantMemory
could be used when building SelectionDAGs, which allows us to freely
schedule loads from constant global variables. The larger problem of
using alias-analysis to improve scheduling is still open, but this
change does address the specific concern you originally asked
about :-).

Dan