w----- Original Message -----
From: "Francois Pichet" <pichet2000@gmail.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
Sent: Tuesday, February 24, 2015 6:17:22 AM
Subject: Re: [LLVMdev] Question about shouldMergeGEPs in InstructionCombining
> From: "Francois Pichet" < pichet2000@gmail.com >
> To: "LLVM Developers Mailing List" < llvmdev@cs.uiuc.edu >
> Sent: Sunday, February 22, 2015 5:34:11 PM
> Subject: [LLVMdev] Question about shouldMergeGEPs in
> InstructionCombining
>
> Hello
>
> I am not sure I understand the logic for merging GEPs in
> InstructionCombining.cpp:
>
> static bool shouldMergeGEPs (GEPOperator &GEP, GEPOperator &Src) {
> // If this GEP has only 0 indices, it is the same pointer as
> // Src. If Src is not a trivial GEP too, don't combine
> // the indices.
> if (GEP.hasAllZeroIndices() && !Src.hasAllZeroIndices() &&
> !Src.hasOneUse())
> return false;
> return true;
> }
>
>
>
> I have a case where
> GEP: %arrayidx7 = getelementptr inbounds i32* %arrayidx, i32 %shl6
> Src: %arrayidx = getelementptr inbounds [4096 x i32]* @phasor_4096,
> i32 0, i32 %shl2
>
>
> GEP. hasAllZeroIndices() will return false and the merge will occur
> Why do we want to combine these 2 getelementptr?
>
>
> On my out of tree target, combining these 2 GetElementPtr create a
> performance regression because since GEP is in a loop (Src is out
> of
> loop), GEP will lower to a more complicated address for a
> subsequent
> load. (the complicated address needs to be calculated over and over
> in the loop)
>
There are a couple of issues here. One, InstCombine's job is the move
the IR toward a reasonable canonical form. It is doing that here,
and I think that's the right thing to do. However, the problem you
point out is a general one. Can you please explain why the
MachineLICM pass is not able to hoist the redundant parts of the
addressing computation out of the loop? We might also want to adjust
for this in CodeGenPrep (CGP already has addressing-mode aware GEP
splitting logic, although not quite for this case).
Hi Hal,
MachineLICM is not able to hoist anything because the address mode is
not loop invariant.
Here is a reduction of the code I am talking about.
extern const unsigned phasor[4096];
void test(unsigned* out , unsigned step_size)
{
unsigned big_step_size = step_size<<2;
int *phasor_ptr_temp_1 = &phasor[big_step_size];
for (int i = 0 ; i < 1020 ; i+=4)
out[i] = phasor_ptr_temp_1[i<<step_size];
}
I am getting slightly better code on my target (Octasic's Opus) if I
return false for shouldMergeGEPs.
I just tried with ppc64 and x86-64 and I am also getting better code
without the GEP merging in InstructionCombining. I am not sure what
the solution is yet but I think we are too aggressive when merging
GEPs in InstructionCombining.
Here is the details for ppc64: http://pastie.org/9978022
First, thanks for posting this, it is quite useful. LLVM trunk's PowerPC backend does a slightly better job now, but for an irrelevant reason, so to stick with the code you generated, we have:
.LBB0_1: # %for.body
# =>This Inner Loop Header: Depth=1
slw 8, 5, 4
ld 9, .LC1@toc@l(7)
addi 5, 5, 4
add 8, 8, 6
extsw 8, 8
sldi 8, 8, 2
lwzx 8, 9, 8
addi 9, 3, 16
stw 8, 0(3)
mr 3, 9
bdnz .LBB0_1
there are two things wrong here, first:
ld 9, .LC1@toc@l(7)
this load is loop invariant, and should be hoisted (but was not). But more to your point, this:
add 8, 8, 6
which represents this in your C code:
int *phasor_ptr_temp_1 = &phasor[big_step_size];
has remained inside the loop. As you point out, the programmer even hoisted it for us, and we sunk it into the loop ourselves. The question is really this: How should we fix this? Generically, I believe that using fewer GEPs makes a cleaner canonical form for the IR as it passes through the mid-level optimizers, and we should split the GEPs to hoist invariant parts out of loops later in the pipeline (in or near CodeGenPrep, LSR, etc.) where we normally do addressing-mode-sensitive transformations.
-Hal