Question about shouldMergeGEPs in InstructionCombining

Hello

I am not sure I understand the logic for merging GEPs in InstructionCombining.cpp:

static boolshouldMergeGEPs(GEPOperator &GEP, GEPOperator &Src) {
// If this GEP has only 0 indices, it is the same pointer as
// Src. If Src is not a trivial GEP too, don’t combine
// the indices.
if (GEP.hasAllZeroIndices() && !Src.hasAllZeroIndices() &&
!Src.hasOneUse())
return false;
return true;
}

I have a case where
GEP: %arrayidx7 = getelementptr inbounds i32* %arrayidx, i32 %shl6
Src: %arrayidx = getelementptr inbounds [4096 x i32]* @phasor_4096, i32 0, i32 %shl2

GEP.hasAllZeroIndices() will return false and the merge will occur
Why do we want to combine these 2 getelementptr?

On my out of tree target, combining these 2 GetElementPtr create a performance regression because since GEP is in a loop (Src is out of loop), GEP will lower to a more complicated address for a subsequent load. (the complicated address needs to be calculated over and over in the loop)

Thanks.

From: "Francois Pichet" <pichet2000@gmail.com>
To: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
Sent: Sunday, February 22, 2015 5:34:11 PM
Subject: [LLVMdev] Question about shouldMergeGEPs in InstructionCombining

Hello

I am not sure I understand the logic for merging GEPs in
InstructionCombining.cpp:

static bool shouldMergeGEPs (GEPOperator &GEP, GEPOperator &Src) {
// If this GEP has only 0 indices, it is the same pointer as
// Src. If Src is not a trivial GEP too, don't combine
// the indices.
if (GEP.hasAllZeroIndices() && !Src.hasAllZeroIndices() &&
!Src.hasOneUse())
return false;
return true;
}

I have a case where
GEP: %arrayidx7 = getelementptr inbounds i32* %arrayidx, i32 %shl6
Src: %arrayidx = getelementptr inbounds [4096 x i32]* @phasor_4096,
i32 0, i32 %shl2

GEP. hasAllZeroIndices() will return false and the merge will occur
Why do we want to combine these 2 getelementptr?

On my out of tree target, combining these 2 GetElementPtr create a
performance regression because since GEP is in a loop (Src is out of
loop), GEP will lower to a more complicated address for a subsequent
load. (the complicated address needs to be calculated over and over
in the loop)

There are a couple of issues here. One, InstCombine's job is the move the IR toward a reasonable canonical form. It is doing that here, and I think that's the right thing to do. However, the problem you point out is a general one. Can you please explain why the MachineLICM pass is not able to hoist the redundant parts of the addressing computation out of the loop? We might also want to adjust for this in CodeGenPrep (CGP already has addressing-mode aware GEP splitting logic, although not quite for this case).

-Hal

Hi Hal,

MachineLICM is not able to hoist anything because the address mode is not
loop invariant.

Here is a reduction of the code I am talking about.

extern const unsigned phasor[4096];
void test(unsigned* out , unsigned step_size)
{
  unsigned big_step_size = step_size<<2;
  int *phasor_ptr_temp_1 = &phasor[big_step_size];
  for (int i = 0 ; i < 1020 ; i+=4)
  out[i] = phasor_ptr_temp_1[i<<step_size];
}

I am getting slightly better code on my target (Octasic's Opus) if I return
false for shouldMergeGEPs.
I just tried with ppc64 and x86-64 and I am also getting better code
without the GEP merging in InstructionCombining. I am not sure what the
solution is yet but I think we are too aggressive when merging GEPs in
InstructionCombining.

Here is the details for ppc64: http://pastie.org/9978022

w----- Original Message -----

From: "Francois Pichet" <pichet2000@gmail.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
Sent: Tuesday, February 24, 2015 6:17:22 AM
Subject: Re: [LLVMdev] Question about shouldMergeGEPs in InstructionCombining

> From: "Francois Pichet" < pichet2000@gmail.com >
> To: "LLVM Developers Mailing List" < llvmdev@cs.uiuc.edu >
> Sent: Sunday, February 22, 2015 5:34:11 PM
> Subject: [LLVMdev] Question about shouldMergeGEPs in
> InstructionCombining
>
> Hello
>
> I am not sure I understand the logic for merging GEPs in
> InstructionCombining.cpp:
>
> static bool shouldMergeGEPs (GEPOperator &GEP, GEPOperator &Src) {
> // If this GEP has only 0 indices, it is the same pointer as
> // Src. If Src is not a trivial GEP too, don't combine
> // the indices.
> if (GEP.hasAllZeroIndices() && !Src.hasAllZeroIndices() &&
> !Src.hasOneUse())
> return false;
> return true;
> }
>
>
>
> I have a case where
> GEP: %arrayidx7 = getelementptr inbounds i32* %arrayidx, i32 %shl6
> Src: %arrayidx = getelementptr inbounds [4096 x i32]* @phasor_4096,
> i32 0, i32 %shl2
>
>
> GEP. hasAllZeroIndices() will return false and the merge will occur
> Why do we want to combine these 2 getelementptr?
>
>
> On my out of tree target, combining these 2 GetElementPtr create a
> performance regression because since GEP is in a loop (Src is out
> of
> loop), GEP will lower to a more complicated address for a
> subsequent
> load. (the complicated address needs to be calculated over and over
> in the loop)
>

There are a couple of issues here. One, InstCombine's job is the move
the IR toward a reasonable canonical form. It is doing that here,
and I think that's the right thing to do. However, the problem you
point out is a general one. Can you please explain why the
MachineLICM pass is not able to hoist the redundant parts of the
addressing computation out of the loop? We might also want to adjust
for this in CodeGenPrep (CGP already has addressing-mode aware GEP
splitting logic, although not quite for this case).

Hi Hal,

MachineLICM is not able to hoist anything because the address mode is
not loop invariant.

Here is a reduction of the code I am talking about.

extern const unsigned phasor[4096];
void test(unsigned* out , unsigned step_size)
{
unsigned big_step_size = step_size<<2;
int *phasor_ptr_temp_1 = &phasor[big_step_size];
for (int i = 0 ; i < 1020 ; i+=4)
out[i] = phasor_ptr_temp_1[i<<step_size];
}

I am getting slightly better code on my target (Octasic's Opus) if I
return false for shouldMergeGEPs.
I just tried with ppc64 and x86-64 and I am also getting better code
without the GEP merging in InstructionCombining. I am not sure what
the solution is yet but I think we are too aggressive when merging
GEPs in InstructionCombining.

Here is the details for ppc64: http://pastie.org/9978022

First, thanks for posting this, it is quite useful. LLVM trunk's PowerPC backend does a slightly better job now, but for an irrelevant reason, so to stick with the code you generated, we have:

.LBB0_1: # %for.body
                                        # =>This Inner Loop Header: Depth=1
  slw 8, 5, 4
  ld 9, .LC1@toc@l(7)
  addi 5, 5, 4
  add 8, 8, 6
  extsw 8, 8
  sldi 8, 8, 2
  lwzx 8, 9, 8
  addi 9, 3, 16
  stw 8, 0(3)
  mr 3, 9
  bdnz .LBB0_1

there are two things wrong here, first:

  ld 9, .LC1@toc@l(7)

this load is loop invariant, and should be hoisted (but was not). But more to your point, this:

  add 8, 8, 6

which represents this in your C code:

  int *phasor_ptr_temp_1 = &phasor[big_step_size];

has remained inside the loop. As you point out, the programmer even hoisted it for us, and we sunk it into the loop ourselves. The question is really this: How should we fix this? Generically, I believe that using fewer GEPs makes a cleaner canonical form for the IR as it passes through the mid-level optimizers, and we should split the GEPs to hoist invariant parts out of loops later in the pipeline (in or near CodeGenPrep, LSR, etc.) where we normally do addressing-mode-sensitive transformations.

-Hal

From: "Hal Finkel" <hfinkel@anl.gov>
To: "Francois Pichet" <pichet2000@gmail.com>
Cc: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>, "chandlerc" <chandlerc@gmail.com>
Sent: Tuesday, February 24, 2015 11:27:43 PM
Subject: Re: [LLVMdev] Question about shouldMergeGEPs in InstructionCombining

w----- Original Message -----
> From: "Francois Pichet" <pichet2000@gmail.com>
> To: "Hal Finkel" <hfinkel@anl.gov>
> Cc: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
> Sent: Tuesday, February 24, 2015 6:17:22 AM
> Subject: Re: [LLVMdev] Question about shouldMergeGEPs in
> InstructionCombining
>
>
>
> > From: "Francois Pichet" < pichet2000@gmail.com >
> > To: "LLVM Developers Mailing List" < llvmdev@cs.uiuc.edu >
> > Sent: Sunday, February 22, 2015 5:34:11 PM
> > Subject: [LLVMdev] Question about shouldMergeGEPs in
> > InstructionCombining
> >
> > Hello
> >
> > I am not sure I understand the logic for merging GEPs in
> > InstructionCombining.cpp:
> >
> > static bool shouldMergeGEPs (GEPOperator &GEP, GEPOperator &Src)
> > {
> > // If this GEP has only 0 indices, it is the same pointer as
> > // Src. If Src is not a trivial GEP too, don't combine
> > // the indices.
> > if (GEP.hasAllZeroIndices() && !Src.hasAllZeroIndices() &&
> > !Src.hasOneUse())
> > return false;
> > return true;
> > }
> >
> >
> >
> > I have a case where
> > GEP: %arrayidx7 = getelementptr inbounds i32* %arrayidx, i32
> > %shl6
> > Src: %arrayidx = getelementptr inbounds [4096 x i32]*
> > @phasor_4096,
> > i32 0, i32 %shl2
> >
> >
> > GEP. hasAllZeroIndices() will return false and the merge will
> > occur
> > Why do we want to combine these 2 getelementptr?
> >
> >
> > On my out of tree target, combining these 2 GetElementPtr create
> > a
> > performance regression because since GEP is in a loop (Src is out
> > of
> > loop), GEP will lower to a more complicated address for a
> > subsequent
> > load. (the complicated address needs to be calculated over and
> > over
> > in the loop)
> >
>
> There are a couple of issues here. One, InstCombine's job is the
> move
> the IR toward a reasonable canonical form. It is doing that here,
> and I think that's the right thing to do. However, the problem you
> point out is a general one. Can you please explain why the
> MachineLICM pass is not able to hoist the redundant parts of the
> addressing computation out of the loop? We might also want to
> adjust
> for this in CodeGenPrep (CGP already has addressing-mode aware GEP
> splitting logic, although not quite for this case).
>
>
>
>
> Hi Hal,
>
>
> MachineLICM is not able to hoist anything because the address mode
> is
> not loop invariant.
>
>
> Here is a reduction of the code I am talking about.
>
>
>
> extern const unsigned phasor[4096];
> void test(unsigned* out , unsigned step_size)
> {
> unsigned big_step_size = step_size<<2;
> int *phasor_ptr_temp_1 = &phasor[big_step_size];
> for (int i = 0 ; i < 1020 ; i+=4)
> out[i] = phasor_ptr_temp_1[i<<step_size];
> }
>
>
> I am getting slightly better code on my target (Octasic's Opus) if
> I
> return false for shouldMergeGEPs.
> I just tried with ppc64 and x86-64 and I am also getting better
> code
> without the GEP merging in InstructionCombining. I am not sure what
> the solution is yet but I think we are too aggressive when merging
> GEPs in InstructionCombining.
>
>
> Here is the details for ppc64: http://pastie.org/9978022
>
>

First, thanks for posting this, it is quite useful. LLVM trunk's
PowerPC backend does a slightly better job now, but for an
irrelevant reason, so to stick with the code you generated, we have:

.LBB0_1: # %for.body
                                        # =>This Inner Loop Header:
                                        Depth=1
  slw 8, 5, 4
  ld 9, .LC1@toc@l(7)
  addi 5, 5, 4
  add 8, 8, 6
  extsw 8, 8
  sldi 8, 8, 2
  lwzx 8, 9, 8
  addi 9, 3, 16
  stw 8, 0(3)
  mr 3, 9
  bdnz .LBB0_1

there are two things wrong here, first:

  ld 9, .LC1@toc@l(7)

this load is loop invariant, and should be hoisted (but was not).

The problem of the non-hoisted TOC load was a backend deficiency, now fixed in r230553. Thanks for providing this example. I'll now give some more thought to the GEP splitting issue.

-Hal

Coincidentally, I just ran into this same issue on some of our benchmarks for the NVPTX backend. You have something like this before instcombine:

%tmp = getelementptr inbounds i32, i32* %input, i64 %offset

loop:

%loop_variant = …
%ptr = getelementptr inbounds i32, i32* %tmp, i64 %loop_variant

Which gets transformed to:

loop:

%loop_variant = …
%sum = add nsw i64 %loop_variant, %offset

%ptr = getelementptr inbounds i32, i32* %input, i64 %sum

The merge essentially reassociates the loop-variant term (%loop_variant) and loop-invariant terms (%input and %offset) in such a way that LICM can’t remove it.

One idea is to only perform this style of gep merge if at least one of the following conditions is true:
(1) both index terms in the GEP are constant. In this case no new add instruction is created, instead the constants are folded.
(2) the GEPs are in the same BB.
(3) LoopInfo is available, and we know we’re not creating a new instruction in a (deeper) loop.

What do you think?

Mark

I think it would make sense for (1) and (2). I am not sure if (3) is feasible in instcombine. (I am not too familiar with LoopInfo)

For the Octasic’s Opus platform, I modified shouldMergeGEPs in our fork to:

if (GEP.hasAllZeroIndices() && !Src.hasAllZeroIndices() &&
!Src.hasOneUse())
return false;

return Src.hasAllConstantIndices(); // was return false;

Following that change, I noticed some performance gain for a few specific tests and no regression at all in our (admittedly limited) benchmarks suite.

Regards,
Francois Pichet, Octasic.

Sorry I meant // was return true;

I think it would make sense for (1) and (2). I am not sure if (3) is
feasible in instcombine. (I am not too familiar with LoopInfo)

LoopInfo should be available for at least one of the instcombine
invocations. In the -O2 pipeline it looks like instcombine is called 7
times and it should have loopinfo for the third invocation (called
immediate after some loop passes). Here's output from
--debug-pass=Structure:

Pass Arguments: -tti -no-aa -tbaa -scoped-noalias
-assumption-cache-tracker -targetlibinfo -basicaa -verify -simplifycfg
-domtree -sroa -early-cse -lower-expect
Target Transform Information
No Alias Analysis (always returns 'may' alias)
Type-Based Alias Analysis
Scoped NoAlias Alias Analysis
Assumption Cache Tracker
Target Library Information
Basic Alias Analysis (stateless AA impl)
  FunctionPass Manager
    Module Verifier
    Simplify the CFG
    Dominator Tree Construction
    SROA
    Early CSE
    Lower 'expect' Intrinsics
Pass Arguments: -targetlibinfo -tti -no-aa -tbaa -scoped-noalias
-assumption-cache-tracker -basicaa -verify-di -ipsccp -globalopt
-deadargelim -domtree -instcombine -simplifycfg -basiccg -prune-eh
-inline-cost -i
nline -functionattrs -sroa -domtree -early-cse -lazy-value-info
-jump-threading -correlated-propagation -simplifycfg -domtree -instcombine
-tailcallelim -simplifycfg -reassociate -domtree -loops -loop-simplify -lc
ssa -loop-rotate -licm -loop-unswitch -instcombine -scalar-evolution
-loop-simplify -lcssa -indvars -loop-idiom -loop-deletion -loop-unroll
-memdep -mldst-motion -domtree -memdep -gvn -memdep -memcpyopt -sccp -dom
tree -bdce -instcombine -lazy-value-info -jump-threading
-correlated-propagation -domtree -memdep -dse -loops -loop-simplify -lcssa
-licm -adce -simplifycfg -domtree -instcombine -barrier -domtree -loops
-loop-sim
plify -lcssa -loop-rotate -branch-prob -block-freq -scalar-evolution
-loop-accesses -loop-vectorize -instcombine -scalar-evolution
-slp-vectorizer -simplifycfg -domtree -instcombine -loops -loop-simplify
-lcssa -s
calar-evolution -loop-unroll -alignment-from-assumptions
-strip-dead-prototypes -globaldce -constmerge -verify -verify-di
Target Library Information
Target Transform Information
No Alias Analysis (always returns 'may' alias)
Type-Based Alias Analysis
Scoped NoAlias Alias Analysis
Assumption Cache Tracker
Basic Alias Analysis (stateless AA impl)
  ModulePass Manager
    Debug Info Verifier
    Interprocedural Sparse Conditional Constant Propagation
    Global Variable Optimizer
    Dead Argument Elimination
    FunctionPass Manager
      Dominator Tree Construction
      *Combine redundant instructions*
      Simplify the CFG
    CallGraph Construction
    Call Graph SCC Pass Manager
      Remove unused exception handling info
      Inline Cost Analysis
      Function Integration/Inlining
      Deduce function attributes
      FunctionPass Manager
        SROA
        Dominator Tree Construction
        Early CSE
        Lazy Value Information Analysis
        Jump Threading
        Value Propagation
        Simplify the CFG
        Dominator Tree Construction
        *Combine redundant instructions*
        Tail Call Elimination
        Simplify the CFG
        Reassociate expressions
        Dominator Tree Construction
        Natural Loop Information
        Canonicalize natural loops
        Loop-Closed SSA Form Pass
        Loop Pass Manager
          Rotate Loops
          Loop Invariant Code Motion
          Unswitch loops
        *Combine redundant instructions*
        Scalar Evolution Analysis
        Canonicalize natural loops
        Loop-Closed SSA Form Pass
        Loop Pass Manager
          Induction Variable Simplification
          Recognize loop idioms
          Delete dead loops
          Unroll loops
        Memory Dependence Analysis
        MergedLoadStoreMotion
        Dominator Tree Construction
        Memory Dependence Analysis
        Global Value Numbering
        Memory Dependence Analysis
        MemCpy Optimization
        Sparse Conditional Constant Propagation
        Dominator Tree Construction
        Bit-Tracking Dead Code Elimination
        *Combine redundant instructions*
        Lazy Value Information Analysis
        Jump Threading
        Value Propagation
        Dominator Tree Construction
        Memory Dependence Analysis
        Dead Store Elimination
        Natural Loop Information
        Canonicalize natural loops
        Loop-Closed SSA Form Pass
        Loop Pass Manager
          Loop Invariant Code Motion
        Aggressive Dead Code Elimination
        Simplify the CFG
        Dominator Tree Construction
        *Combine redundant instructions*
    A No-Op Barrier Pass
    FunctionPass Manager
      Dominator Tree Construction
      Natural Loop Information
      Canonicalize natural loops
      Loop-Closed SSA Form Pass
      Loop Pass Manager
        Rotate Loops
      Branch Probability Analysis
      Block Frequency Analysis
      Scalar Evolution Analysis
      Loop Access Analysis
      Loop Vectorization
      *Combine redundant instructions*
      Scalar Evolution Analysis
      SLP Vectorizer
      Simplify the CFG
      Dominator Tree Construction
      *Combine redundant instructions*
      Natural Loop Information
      Canonicalize natural loops
      Loop-Closed SSA Form Pass
      Scalar Evolution Analysis
      Loop Pass Manager
        Unroll loops
      Alignment from assumptions
    Strip Unused Function Prototypes
    Dead Global Elimination
    Merge Duplicate Global Constants
    FunctionPass Manager
      Module Verifier
    Debug Info Verifier
    Bitcode Writer

Mark

For the Octasic's Opus platform, I modified shouldMergeGEPs in our fork to:

Hi Mark,

It is not clear to me at all that preventing the merging is the right solution. There are a large number of analysis, including alias analysis, and optimizations that use GetUnderlyingObject, and related routines to search back through GEPs. They only do this up to some small finite depth (six, IIRC). So reducing the GEP depth is likely the right solution for InstCombine (which has the job of canonicalizing the IR).

We should, however, pull these apart somewhere, and probably in some way that is address-mode aware. I'd recommend trying to split non-free (via the addressing-mode) loop-invariant parts of GEPs out of loops in CodeGenPrep.

-Hal

It is not clear to me at all that preventing the merging is the right
solution. There are a large number of analysis, including alias analysis,
and optimizations that use GetUnderlyingObject, and related routines to
search back through GEPs. They only do this up to some small finite depth
(six, IIRC). So reducing the GEP depth is likely the right solution for
InstCombine (which has the job of canonicalizing the IR).

We should, however, pull these apart somewhere, and probably in some way
that is address-mode aware. I'd recommend trying to split non-free (via the
addressing-mode) loop-invariant parts of GEPs out of loops in CodeGenPrep.

Thanks, Hal. I'll have a look at CGP to see how this might be done. It's
a little more complicated than just pulling the GEP apart, there needs to
be a loop-invariant-aware reassociation to undo the damage done by the
initial merge.

Mark

So, this is in fact, just using the ranks reassociation would normally use, to do splitting :slight_smile:

That is, even outside of geps, you end up with chains of operations where, if you moved the operands around, you would expose loop-invariant calculation.

Reassociation builds a rank map ordered by RPO traversal of the CFG, and uses it to place operations at the same rank into the same expression. This guarantees deeper loops have higher ranks.
For your purposes, if you have it calculated already, you could just use loop depth instead of RPO ordering, since you only care about invariantness (the downside to not using RPO here is that you may increase register pressure. The upside is it’s easier to reason about the ranks for your purposes).

Anyway, reassociate tries to place operations with the lowest ranks into leaf computations, since those are “the most loop invariant”.

You want to do exactly the same thing, with likely the same algorithm, splitting geps as necessary into “leaf geps” to place low-ranking operations in the same gep.

The only heuristic part is “what is the lowest rank you want to split for”.
If you have stuff at rank 0 and stuff not at rank 0, you always want to split out the rank 0 stuff.
rank 1 and rank 2, well, if you are using loop depth, it can only be invariant in a loop of rank 2, etc.
You don’t need to split if rank == highest rank in functions, since you are guaranteed it is not invariant in any loop in that case

Hi Daniel,

Thanks! I wonder if there’s a way to reuse some code in Reassociation. Looks like most of the logic we want to implement is in Reassociate.cpp already. But the entire pass seems too expensive to run with CGP or after each instcombine.

Jingyue

Honestly, i’m not sure it’s worth it.
If you want stuff, the only thing you want is probably rewriteExprTree. The ranking code is about 10 lines if you are just using loop depth, and so is the sorting code.
You have to do splitting before you hand it to rewriteexprtree anyway.

Otherwise, reassociate probably has too much high level logic/optimization for what you want, and it’s not entirely cleanly factored

From: "Jingyue Wu" <jingyue@google.com>
To: "Daniel Berlin" <dberlin@dberlin.org>, "Mark Heffernan" <meheff@google.com>, "Hal Finkel" <hfinkel@anl.gov>
Cc: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
Sent: Friday, March 13, 2015 1:31:59 PM
Subject: Re: [LLVMdev] Question about shouldMergeGEPs in InstructionCombining

Hi Daniel,

Thanks! I wonder if there's a way to reuse some code in
Reassociation. Looks like most of the logic we want to implement is
in Reassociate.cpp already. But the entire pass seems too expensive
to run with CGP or after each instcombine.

I don't think that the algorithm itself is too expensive for CGP (and we already do dominance-aware GEP splitting in CGP), we don't do any of this at -O0 anyway, but we need something that specifically handles GEPs. For most operations (adds, etc.) reassociating in a way likely to enable LICM is, I believe, already our canonical IR form, and so we do that as part of the normal optimization pipeline. For GEPs, it is anti-canonical, and so GEPs are the special case that needs handling in CGP where we perform these kinds of anti-canonical transformations.

-Hal

I implemented this optimization within CGP and it helped several of our benchmarks as well as addressing Francois’ original issue for powerpc. It’s pretty general and shouldn’t negatively impact the ability to take advantage of free address computation in the fancy addressing modes of some processors.

However, when writing it I realized a better place to put this might be what is now called the “SeparateConstOffsetFromGEP” pass. It performs two related GEP optimizations late in compilation very close to CGP: reassociating constants in GEP expressions and splittling GEPs. This new pass would be able to share a good chunk of existing code as well (hoisting values out index expressions). Sound reasonable?

Mark

SGTM. I think SeparateConstOffsetFromGEP is a nice place to be. The part that reassociates constants seems subsumed by your new optimization which AFAIK ranks nodes in expression trees and reorganizes them per the ranking.

Also, this pass should be renamed to something like GEPReassociate after your optimization gets in.

Jingyue