[ARM] [PIC] optimizing the loading of hidden global variable

Hi,

When I’m compiling a code with –fvisibility=hidden –fPIC for ARM, I find that LLVM generates less optimized code than GCC.

For example:

test.cpp:

void init(void *);

int g0[100];

int g1[100];

int g2[100];

void foo() {

init(&g0);

init(&g1);

init(&g2);

}

Clang will emit 1 GOT entry for each GV and 2 instructions to get the address:

ldr r0, .LCPI0_2

add r0, r0, r4

bl _Z4initPv(PLT)

GCC does this only for the first GV. The rest GV address are computed directly:

ldr r4, .L2

.LPIC0:

add r4, pc, r4 è get &g0 via GOT_PC Relative

mov r0, r4

bl _Z4initPv(PLT)

add r0, r4, #400 è get &g1

bl _Z4initPv(PLT)

add r0, r4, #800 è get &g2

ldmfd sp!, {r4, lr}

b _Z4initPv(PLT)

.L3:

.align 2

.L2:

.word .LANCHOR0-(.LPIC0+8) è 1 GOT offset entry

It seems it’s a missing optimizing opportunity for LLVM both in code size and performance, any ideas? If so, I can open a bug and try to fix it.

Thanks,

Weiming

Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation

Hi Weiming,

Hi Tim,

Thanks for the pointer. It seems GlobalMerge only considers static/local GVs:
  if (!I->hasLocalLinkage() || I->isThreadLocal() || I->hasSection())
      continue;
Let me try some experiments in GlobalMerge.

Another place might be in ARMISelLowering.cpp :: LowerGlobalAddressELF(), but I think changing GlobalMerge makes more sense.

Thanks,
Weiming

Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation

Hi Tim,

The global merge pass puts the GVs into a sturcture to guarantee their
address are contiguous.
It works for static GVs but for global hidden GVs, this will cause name
resoltion fail during linking .o into .so

Any thoughs?

Thanks,
Weiming

It works for static GVs but for global hidden GVs, this will cause name
resoltion fail during linking .o into .so

Ah, I see. I've just looked up the semantics and GlobalMerge probably
won't work, I agree.

Any thoughs?

I'm now struggling to see how GCC justifies it. What if a different
translation-unit declared those variables in a different order? I also
can't get the same behaviour here, do you have a more complete
command-line?

Cheers.

Tim.

Any thoughs?

I'm now struggling to see how GCC justifies it. What if a different
translation-unit declared those variables in a different order? I also
can't get the same behaviour here, do you have a more complete
command-line?

Ah, I see; the translation-unit that does the optimisation needs to
have them as a definition (i.e. "= {0}") rather than a declaration for
the optimisation to kick in, giving it precedence over other
declarations. And the hidden-visibility means they won't be
R_ARM_COPYed out of their initial location.

After a very brief thought, I'd still go for GlobalMerge now, in
conjunction with an enhanced "alias" so that you could emit something
like:

    @g1 = hidden alias [100 x i32]* bitcast(i32* getelementptr([300 x
i32]* @Merged, i32 0, i32 0) to [100 x i32]*)

We certainly don't seem to handle this alias properly now though, and
it may violate the intended uses. Rafael's doing some thinking about
"alias" at the moment, so I've CCed him.

Would that be a horrific abuse of the poor alias system?

Cheers.

Tim.

After a very brief thought, I'd still go for GlobalMerge now, in
conjunction with an enhanced "alias" so that you could emit something
like:

    @g1 = hidden alias [100 x i32]* bitcast(i32* getelementptr([300 x
i32]* @Merged, i32 0, i32 0) to [100 x i32]*)

We certainly don't seem to handle this alias properly now though, and
it may violate the intended uses. Rafael's doing some thinking about
"alias" at the moment, so I've CCed him.

Would that be a horrific abuse of the poor alias system?

I think it would :slight_smile: Folding objects like this prevents the linker
from deleting one of them if it is unused for example.

I think it is just a missing optimization in the ARM backend. If it
knows multiple objecs are in the same DSO, it can use the address of
one to find the other.

Given:

@g0 = hidden global [100 x i32] zeroinitializer, align 4
@g1 = hidden global [100 x i32] zeroinitializer, align 4
define void @foo() {
  tail call void @bar(i8* bitcast ([100 x i32]* @g0 to i8*))
  tail call void @bar(i8* bitcast ([100 x i32]* @g1 to i8*))
  ret void
}
declare void @bar(i8*)

The command "llc -mtriple=i686-pc-linux -relocation-model=pic" produces

calll .L0$pb
.L0$pb:
popl %ebx
.Ltmp3:
addl $_GLOBAL_OFFSET_TABLE_+(.Ltmp3-.L0$pb), %ebx
leal g0@GOTOFF(%ebx), %eax
movl %eax, (%esp)
calll bar@PLT
leal g1@GOTOFF(%ebx), %eax
movl %eax, (%esp)
calll bar@PLT

Which is ok , since the add of ebx is folded and the constant is an
immediate in x86.

On ARM, that is not the case. We produce

        ldr r0, .LCPI0_0
        add r4, pc, r0 // r4 is the equivalent of ebx in the x86 case.
        ldr r0, .LCPI0_1 // r0 is the constant that is an
immediate in x86.
        add r0, r0, r4 // that is the add that is folded in x86
...
.LCPI0_0:
        .long _GLOBAL_OFFSET_TABLE_-(.LPC0_0+8)
.LCPI0_1:
        .long g0(GOTOFF)

For ARM, codegen already keeps tracks of offset so it can implement
the constant islands, so it should be able to see that the two globals
are close enough that offset between them fits an immediate.

Nick, will this work on MachO or can ld64 move _g0, _g1 and _g2 too far apart?

BTW, what will gcc produce for

void init(void *);
extern int g0[100] __attribute__((visibility("hidden")));
extern int g1[100] __attribute__((visibility("hidden")));
extern int g2[100] __attribute__((visibility("hidden")));
void foo() {
  init(&g0);
  init(&g1);
  init(&g2);
}

Cheers,
Rafael

On ARM, that is not the case. We produce

       ldr r0, .LCPI0_0
       add r4, pc, r0 // r4 is the equivalent of ebx in the x86 case.
       ldr r0, .LCPI0_1 // r0 is the constant that is an
immediate in x86.
       add r0, r0, r4 // that is the add that is folded in x86
...
.LCPI0_0:
       .long _GLOBAL_OFFSET_TABLE_-(.LPC0_0+8)
.LCPI0_1:
       .long g0(GOTOFF)

For ARM, codegen already keeps tracks of offset so it can implement
the constant islands, so it should be able to see that the two globals
are close enough that offset between them fits an immediate.

Nick, will this work on MachO or can ld64 move _g0, _g1 and _g2 too far apart?

When this is compiled, you only know that g0, g1, and g2 will be in the same
linkage unit. You don’t know if they will come from the same translation unit.
You don’t know how big the overall __DATA segment will be, so yes it is
quite possible g0, g1, and g2 will be more that 64KB apart.

Also, 32-bit arm for mach-o does not use a GOT. The compiler does create
GOT-like slots call non-lazy-pointers for accessing symbols defined outside
the translation unit. Given that the arrays are declared hidden, that means
they will be defined in the linkage unit. So, ideally the non-lazy-pointer indirection
could be removed and have the code directly access the array. The problem
is mach-o has no relocation for pointer diffs where one of the pointers is
undefined. Currently, the only solution is to get the optimal (not use non-lazy-
pointer) is to build with LTO.

-Nick

Hi Rafael,

Yes, merging gv prevents linker to do garbage collection. Should it be implemented as a peephole pass? If we do it too early, the distance between GVs are not fixed yet.

PS:
Below is the GCC output with "extern" hidden:
  ldr r2, .L2
  stmfd sp!, {r3, lr}
  .save {r3, lr}
.LPIC0:
  add r0, pc, r2
  bl _Z4initPv(PLT)
  ldr r1, .L2+4
.LPIC1:
  add r0, pc, r1
  bl _Z4initPv(PLT)
  ldr r0, .L2+8
.LPIC2:
  add r0, pc, r0
  ldmfd sp!, {r3, lr}
  b _Z4initPv(PLT)
.L3:
  .align 2
.L2:
  .word g0-(.LPIC0+8)
  .word g1-(.LPIC1+8)
  .word g2-(.LPIC2+8)

Thanks,
Weiming

Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation

Hi Rafael,

Yes, merging gv prevents linker to do garbage collection. Should it be implemented as a peephole pass? If we do it too early, the distance between GVs are not fixed yet.

Correct. It would be somewhere in CodeGen, I am not exactly sure where.

PS:
Below is the GCC output with "extern" hidden:
        ldr r2, .L2
        stmfd sp!, {r3, lr}
        .save {r3, lr}
.LPIC0:
        add r0, pc, r2
        bl _Z4initPv(PLT)
        ldr r1, .L2+4
.LPIC1:
        add r0, pc, r1
        bl _Z4initPv(PLT)
        ldr r0, .L2+8
.LPIC2:
        add r0, pc, r0
        ldmfd sp!, {r3, lr}
        b _Z4initPv(PLT)
.L3:
        .align 2
.L2:
        .word g0-(.LPIC0+8)
        .word g1-(.LPIC1+8)
        .word g2-(.LPIC2+8)

That is pretty neat too.

Cheers,
Rafael

I just gave a try to MegeGlobal with alias because I thought it's easy to do. However, another issue with it is:
Although I got aliases like:
@h0 = alias getelementptr inbounds (...@_MergedGlobals, 0, 0)
@h1 = alias getelementptr inbounds (...@_MergedGlobals, 0, 1)
@h2 = alias getelementptr inbounds (...@_MergedGlobals, 0, 2)

They cannot be lowered to correct asm. The all be aliases of _MergedGlobals:
  .globl h0
.set h0, _MergedGlobals
  .globl h1
.set h1, _MergedGlobals
  .globl h2
.set h2, _MergedGlobals

I guess there is no support in asm to alias to a member of struct, right?

Thanks,
Weiming

Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation

More or less. You can use a constantexpr, but we wast to remove that
and replace it with an explicit offset. llvm.org/pr10367.

There could be bugs too, of course.

Cheers,
Rafael