RFC: Indexing of structs vs arrays in getelementpointer

Recently I posted a patch to migrate certain GEPs between basic blocks in cases where doing so would improve the ability of instcombine to merge into more complicated addressing mode (r209049 and r209065). After some build to failures it was rolled back. I now have a patch that no longer causes the regressions I was seeing, but it also no longer can optimize the case I was trying to optimize. As several other people have shown me optimization cases where the patch does work for them I am going to commit it again once I write some new test cases people who have access to the relevant (failing) configs can verify that it works for them, but I wanted to discuss the the case that I was trying to optimize, and how transforming the types of some some values would allow for the optimization to occur.

It is worth noting up front that I could certainly do this optimization in a peephole, but I think by looking at some broader changes it might be possible to tease out other optimization opportunities.

The problem:

On x86 when you compile:

  unsigned testu(llvm::DenseMap<unsigned, unsigned> &dm, unsigned key) {
    return dm.lookup(key);
  }

is compiled down it contains:

  leaq (%r8,%rdi,8), %rax
  movl 4(%rax), %eax

when what you really want is:

  movl 4(%r8,%rdi,8), %eax

Exactly how bad that is depends on the exact micro architecture as some machines have different cache port configs and AGU capabilities, but it isn’t good. The reason why we get into this situation is that The IR that results GEP->PHI->GEP->LOAD chain. Currently the first GEP and the GEP->LOAD are being matched separately into the leaq and movl respectively. I wrote some code to move GEPs across PHIs so that the whole GEP->GEP->LOAD could be matched, and it first glance it worked and generated the optimized movl:

  %struct2 = type { i32, i32 }
  bb1:
    %tmp10 = getelementptr inbounds %struct2* %tmp1, i64 %tmp9
    br label %bb3
                                                                                
  bb2:
    %tmp20 = getelementptr inbounds %struct2* %tmp1, i64 %tmp19
    br label %bb3
                                                                                
  bb3:
    %phi = phi %struct2* [ %tmp10, %bb1 ], [ %tmp20, %bb2 ]
    %tmp24 = getelementptr inbounds %struct2* %phi, i64 0, i32 1
    %tmp25 = load i32* %tmp24, align 4

is rewritten as:

  %struct2 = type { i32, i32 }
  bb1:
    br label %bb3
                                                                                
  bb2:
    br label %bb3
                                                                                
  bb3:
    %phi = phi %struct2* [ %tmp9, %bb1 ], [ %tmp19, %bb2 ]
    %tno21 = getelementptr inbounds %struct2* %tmp1, i64 % phi
    %tmp24 = getelementptr inbounds %struct2* %phi, i64 0, i32 1
    %tmp25 = load i32* %tmp24, align 4

Which eventually gets InstCombined into a single getelementptr and matched to the single movl.

The problem that the above transform is technically illegal because “When indexing into a (optionally packed) structure, only i32 integer constants are allowed (when using a vector of indices they must all be the same i32 integer constant).” rule <http://llvm.org/docs/LangRef.html#getelementptr-instruction>. In practice it worked in this case, I can only presume that despite the language ref claiming that GEPs into structs require constant indexes that in practice variable indexes work so long as the elements all have a common size/alignment. Having said that it blew up in other cases, and when I added a check to prevent merging indexes that were structs it caused the above case to no longer be optimizable.

What to do about it:

Well, there are a couple of pragmatic solutions that I could apply. I could handle it as a peephole on x86, or do some bit casts in the IR. Before I go down that path I wanted to ask a more interesting question: Are we missing other optimization opportunities because we are using structs for homogenous data that could be expressed in terms of arrays? If we were to write an early pass that effectively rewrote structs into arrays would we see any improvements (besides the one I mentioned in this email). There are a couple of different ways of doing it:

{i32, i32} has the same layout as [2 x i32] or {[2 x i32]}. My gut feeling is that [2 x i32] is the most optimizable, but that {[2 x i32]} probably is almost and good and potentially eliminates certain edge cases. How about heterogenous structs with homogenous runs: Does rewriting {i32, i32, i16} as {[2 x i32], i16} allow for better optimizations? Or maybe it makes sense to loosen the language restrictions on GEP indexes?

Louis

Wait, I don't follow. You don't violate this rule. The first index in GEP
is *not* into the structure, it is along the implicit array of objects that
any pointer refers to. Thus the i64 operand to the first GEP selects a
different struct object in an array of them, and the first i64 operand to
the second GEP re-selects the same struct object but then uses an i32 index
into it.

Perhaps you need a better example to show the illegal transform? That would
help me understand the rest of your problem.

Hmm… you are correct, it turns out you are correct, I only have to worry if it is constant after the second arg of the GEP. That should allow my case to continue to work. I am currently building a stage2 now to see if works cleanly.

Having said that, there still exist cases where you are indexing into a homogenous struct like so. Consider the following two functions, which are the same except one is written using structs and the other is written using arrays:

%struct1 = type { i32, i32 }
%struct2 = type { %struct1, %struct1 }
; Function Attrs: ssp uwtable
define i32 @test1(%struct2* %dm, i1 %tmp4, i64 %tmp9, i64 %tmp19) {
bb:
br i1 %tmp4, label %bb1, label %bb2
bb1: ; preds = %bb5
%tmp10 = getelementptr inbounds %struct2* %dm, i64 %tmp9, i32 0
br label %bb3

bb2: ; preds = %.lr.ph.i.i
%tmp20 = getelementptr inbounds %struct2* %dm, i64 %tmp9, i32 1
br label %bb3

bb3: ; preds = %bb2, %bb1
%phi = phi %struct1* [ %tmp10, %bb1 ], [ %tmp20, %bb2 ]
%tmp24 = getelementptr inbounds %struct1* %phi, i64 0, i32 1
%tmp25 = load i32* %tmp24, align 4
ret i32 %tmp25
}

%array1 = type [2 x i32]
%array2 = type [2 x %array1]

; Function Attrs: ssp uwtable
define i32 @test2(%array2* %dm, i1 %tmp4, i64 %tmp9, i64 %tmp19) {
bb:
br i1 %tmp4, label %bb1, label %bb2
bb1: ; preds = %bb5
%tmp10 = getelementptr inbounds %array2* %dm, i64 %tmp9, i32 0
br label %bb3

bb2: ; preds = %.lr.ph.i.i
%tmp20 = getelementptr inbounds %array2* %dm, i64 %tmp19, i32 1
br label %bb3

bb3: ; preds = %bb2, %bb1
%phi = phi %array1* [ %tmp10, %bb1 ], [ %tmp20, %bb2 ]
%tmp24 = getelementptr inbounds %array1* %phi, i64 0, i32 1
%tmp25 = load i32* %tmp24, align 4
ret i32 %tmp25
}

The @test1 function cannot have the optimization applied because the %tmp10 and %tmp20 GEPs vary by a field index into the struct, whereas @test2 can be optimized to:

define i32 @test2([2 x [2 x i32]]* %dm, i1 %tmp4, i64 %tmp9, i64 %tmp19) {
bb:
br i1 %tmp4, label %bb1, label %bb2

bb1: ; preds = %bb
br label %bb3

bb2: ; preds = %bb
br label %bb3

bb3: ; preds = %bb2, %bb1
%0 = phi i32 [ 0, %bb1 ], [ 1, %bb2 ]
%tmp24 = getelementptr inbounds [2 x [2 x i32]]* %dm, i64 %tmp9, i32 %0, i32 1
%tmp25 = load i32* %tmp24, align 4
ret i32 %tmp25
}

Louis

Hmm… you are correct, it turns out you are correct, I only have to worry
if it is constant after the second arg of the GEP. That should allow my
case to continue to work. I am currently building a stage2 now to see if
works cleanly.

Cool. Hopefully we can make incremental progress on the problem while
figuring out the right thing to do in the fully general IR case...

Having said that, there still exist cases where you are indexing into a
homogenous struct like so. Consider the following two functions, which are
the same except one is written using structs and the other is written using
arrays:

Ah, delightful. I was pretty sure there was a more complex example that
still exhibited the problem, but wanted to see it to think through things
properly.

%struct1 = type { i32, i32 }

%struct2 = type { %struct1, %struct1 }

; Function Attrs: ssp uwtable

define i32 @test1(%struct2* %dm, i1 %tmp4, i64 %tmp9, i64 %tmp19) {

bb:

  br i1 %tmp4, label %bb1, label %bb2

bb1: ; preds = %bb5

  %tmp10 = getelementptr inbounds %struct2* %dm, i64 %tmp9, i32 0

  br label %bb3

bb2: ; preds = %.lr.ph.i.i

  %tmp20 = getelementptr inbounds %struct2* %dm, i64 %tmp9, i32 1

  br label %bb3

bb3: ; preds = %bb2, %bb1

  %phi = phi %struct1* [ %tmp10, %bb1 ], [ %tmp20, %bb2 ]

  %tmp24 = getelementptr inbounds %struct1* %phi, i64 0, i32 1

  %tmp25 = load i32* %tmp24, align 4

  ret i32 %tmp25

}

%array1 = type [2 x i32]

%array2 = type [2 x %array1]

; Function Attrs: ssp uwtable

define i32 @test2(%array2* %dm, i1 %tmp4, i64 %tmp9, i64 %tmp19) {

bb:

  br i1 %tmp4, label %bb1, label %bb2

bb1: ; preds = %bb5

  %tmp10 = getelementptr inbounds %array2* %dm, i64 %tmp9, i32 0

  br label %bb3

bb2: ; preds = %.lr.ph.i.i

  %tmp20 = getelementptr inbounds %array2* %dm, i64 %tmp19, i32 1

  br label %bb3

bb3: ; preds = %bb2, %bb1

  %phi = phi %array1* [ %tmp10, %bb1 ], [ %tmp20, %bb2 ]

  %tmp24 = getelementptr inbounds %array1* %phi, i64 0, i32 1

  %tmp25 = load i32* %tmp24, align 4

  ret i32 %tmp25

}

The @test1 function cannot have the optimization applied because the
%tmp10 and %tmp20 GEPs vary by a field index into the struct, whereas
@test2 can be optimized to:

define i32 @test2([2 x [2 x i32]]* %dm, i1 %tmp4, i64 %tmp9, i64 %tmp19) {
bb:
  br i1 %tmp4, label %bb1, label %bb2

bb1: ; preds = %bb
  br label %bb3

bb2: ; preds = %bb
  br label %bb3

bb3: ; preds = %bb2, %bb1
  %0 = phi i32 [ 0, %bb1 ], [ 1, %bb2 ]
  %tmp24 = getelementptr inbounds [2 x [2 x i32]]* %dm, i64 %tmp9, i32 %0,
i32 1
  %tmp25 = load i32* %tmp24, align 4
  ret i32 %tmp25
}

So, here is why relaxing this rule is hard:

Without a constant index, we don't know what sub-struct to use for
interpreting the next index in the event of heterogeneity. Without this, we
can't do anything, and so we definitionally preclude the transform.

While clearly we can make this transform safely for homogenous structs,
doing so seriously weakens the IR's guarantees. I'd not like to see us go
that direction.

Instead, if this transformation is indeed important (and it sounds like it
is), I have a somewhat crazier idea: fold homogeneous struct type layers
into arrays. Thoughts? (I've not thought about this much, there might be
dragons here...)

From: "Chandler Carruth" <chandlerc@google.com>
To: "Louis Gerbarg" <lgg@apple.com>
Cc: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
Sent: Thursday, May 22, 2014 7:09:49 PM
Subject: Re: [LLVMdev] RFC: Indexing of structs vs arrays in getelementpointer

Hmm… you are correct, it turns out you are correct, I only have to
worry if it is constant after the second arg of the GEP. That should
allow my case to continue to work. I am currently building a stage2
now to see if works cleanly.

Cool. Hopefully we can make incremental progress on the problem while
figuring out the right thing to do in the fully general IR case...

Having said that, there still exist cases where you are indexing into
a homogenous struct like so. Consider the following two functions,
which are the same except one is written using structs and the other
is written using arrays:

Ah, delightful. I was pretty sure there was a more complex example
that still exhibited the problem, but wanted to see it to think
through things properly.

%struct1 = type { i32, i32 }
%struct2 = type { %struct1, %struct1 }
; Function Attrs: ssp uwtable
define i32 @test1(%struct2* %dm, i1 %tmp4, i64 %tmp9, i64 %tmp19) {
bb:
br i1 %tmp4, label %bb1, label %bb2
bb1: ; preds = %bb5
%tmp10 = getelementptr inbounds %struct2* %dm, i64 %tmp9, i32 0
br label %bb3

bb2: ; preds = %.lr.ph.i.i
%tmp20 = getelementptr inbounds %struct2* %dm, i64 %tmp9, i32 1
br label %bb3

bb3: ; preds = %bb2, %bb1
%phi = phi %struct1* [ %tmp10, %bb1 ], [ %tmp20, %bb2 ]
%tmp24 = getelementptr inbounds %struct1* %phi, i64 0, i32 1

%tmp25 = load i32* %tmp24, align 4
ret i32 %tmp25
}

%array1 = type [2 x i32]
%array2 = type [2 x %array1]

; Function Attrs: ssp uwtable
define i32 @test2(%array2* %dm, i1 %tmp4, i64 %tmp9, i64 %tmp19) {
bb:
br i1 %tmp4, label %bb1, label %bb2
bb1: ; preds = %bb5
%tmp10 = getelementptr inbounds %array2* %dm, i64 %tmp9, i32 0
br label %bb3

bb2: ; preds = %.lr.ph.i.i
%tmp20 = getelementptr inbounds %array2* %dm, i64 %tmp19, i32 1
br label %bb3

bb3: ; preds = %bb2, %bb1
%phi = phi %array1* [ %tmp10, %bb1 ], [ %tmp20, %bb2 ]
%tmp24 = getelementptr inbounds %array1* %phi, i64 0, i32 1

%tmp25 = load i32* %tmp24, align 4
ret i32 %tmp25
}

The @test1 function cannot have the optimization applied because the
%tmp10 and %tmp20 GEPs vary by a field index into the struct,
whereas @test2 can be optimized to:

define i32 @test2([2 x [2 x i32]]* %dm, i1 %tmp4, i64 %tmp9, i64
%tmp19) {
bb:
br i1 %tmp4, label %bb1, label %bb2

bb1: ; preds = %bb
br label %bb3

bb2: ; preds = %bb
br label %bb3

bb3: ; preds = %bb2, %bb1
%0 = phi i32 [ 0, %bb1 ], [ 1, %bb2 ]
%tmp24 = getelementptr inbounds [2 x [2 x i32]]* %dm, i64 %tmp9, i32
%0, i32 1

%tmp25 = load i32* %tmp24, align 4
ret i32 %tmp25
}

So, here is why relaxing this rule is hard:

Without a constant index, we don't know what sub-struct to use for
interpreting the next index in the event of heterogeneity. Without
this, we can't do anything, and so we definitionally preclude the
transform.

While clearly we can make this transform safely for homogenous
structs, doing so seriously weakens the IR's guarantees. I'd not
like to see us go that direction.

To what guarantees are you referring?

Thanks again,
Hal

That we can (verifier) reject arbitrarily invalid sequences of indices in
GEPs into structs because we know which struct we're GEPing into and what
potential indices would be valid. This in turn simplifies the interfaces of
code dealing with struct GEP indices.

Ah, okay. Then what do you mean by, "fold homogeneous struct type layers into arrays."? Do you mean refactor to make them look like arrays at the API level?

-Hal

No, to change the types used to access this data to actually use an array
(rather than a homogeneous struct) at the layer where it is possible to
dynamically index without issue.

Hal rightly pointed out (after some denseness on my part) that I had
completely misread these two paragraphs. Sorry about that. Got distracted
by understanding your example, and missed that you had already done the
same analysis I had and even come to the exact same conclusion.

Well, there are a couple of pragmatic solutions that I could apply. I
could handle it as a peephole on x86, or do some bit casts in the IR.
Before I go down that path I wanted to ask a more interesting question: Are
we missing other optimization opportunities because we are using structs
for homogenous data that could be expressed in terms of arrays? If we were
to write an early pass that effectively rewrote structs into arrays would
we see any improvements (besides the one I mentioned in this email). There
are a couple of different ways of doing it:

{i32, i32} has the same layout as [2 x i32] or {[2 x i32]}. My gut feeling
is that [2 x i32] is the most optimizable, but that {[2 x i32]} probably is
almost and good and potentially eliminates certain edge cases. How about
heterogenous structs with homogenous runs: Does rewriting {i32, i32, i16}
as {[2 x i32], i16} allow for better optimizations?

I think that *if* this makes sense, it makes sense to phrase the transform
as folding a run of the same type in a struct into an array of that type.
That is, we should never remove a layer of {}s from a type because that
might have alignment implications or other implications I've not thought
of. But within a {}, I think it makes a *lot* of sense to have a single,
canonical way to represent a sequence of N types, and [N x <type>] seems
like the obvious representation, much more so than repeating the type N
times.

I think at least an early pass makes a lot of sense here. I wonder whether
it makes sense to do something even more radical such as having the struct
type for '{i32, i32, i32}' *be* the same type as '{[N x i32]}' for all N !=
1. I think it would be good to do structural canonicalization of this form
quite early on the structural LLVM types, but I haven't even begun to think
about how much code that breaks.

I would suggest implementing your GEP optimization conservatively and
correctly, committing it, and then resuming this thread with some numbers
regarding what performance improvements can be had by doing this kind of
structural canonicalization?

No worries. If anything the fact that you effectively came to the same conclusion in this way is confirmation that the idea is worth exploring.

That seems reasonable to me.

Yeah. If we decided to progress down that route I think it probably make sense to start with homogenous structs as the GEP rewrites and bookkeeping are easier, and then move on from there if we see benefits.

That sounds good. I had hoped to get that done today, but between various other things coming up it might be sometime this weekend/early next week.

Louis