[LLD] Slow callstacks in gdb

Hi,

for programs linked with lld it's substantially slower to get callstacks
in gdb, in comparison to gold-linked programs. Two measurements:

lld gold
15 sec 3 sec
6 sec 2 sec

This is a debug build, rather large binaries (lots of templates). I have
seen even worse performance for debug+UBSan builds. I think code size (and
therefore DWARF size) has an impact. Is there some information missing
that gdb needs, and only gold generates?

gdb version is 8. I tested lld 5.0 and an earlier 4.0 trunk version.

Note that these binaries do not use gdb indexing.

Has anyone seen something similar?

Best regards,
Martin

I do not know what is going on with your binary, but I’d first inspect section sizes. Can you run readelf --sections against the two executable to see if there’s significant difference in section size?

Martin Richtarsky via llvm-dev <llvm-dev@lists.llvm.org> writes:

Hi,

for programs linked with lld it's substantially slower to get callstacks
in gdb, in comparison to gold-linked programs. Two measurements:

lld gold
15 sec 3 sec
6 sec 2 sec

Are both using --gdb-index? Can you try lld trunk if so?

Is any of the programs you tested open source?

Cheers,
Rafael

Is the program being built by gcc or clang?

Cheers,
Rafael

Rafael Avila de Espindola <rafael.espindola@gmail.com> writes:

Hi,

Is the program being built by gcc or clang?

gcc 6, but I can try clang.

Are both using --gdb-index? Can you try lld trunk if so?

No.

Is any of the programs you tested open source?

No.

Here is some more info. I enabled "set verbose on" in gdb. With this
setting, I'm getting DWARF errors in both versions when executing "bt"
(linked by gold and lld), but the lld version has significantly more
errors. These only appear in the lld version:

(gdb) bt
Reading in symbols for abc.cpp....
debug_line address at offset 0x575128 is 0 [in module abc.so]....
debug_line address at offset 0x575141 is 0 [in module abc.so]....
cannot get low and high bounds for subprogram DIE at 52844002...
cannot get low and high bounds for subprogram DIE at 52844042...
cannot get low and high bounds for subprogram DIE at 52844092...
const value length mismatch for 'std::ios_constants::boolalpha', got 4,
expected 0
debug info gives command-line macro definition with non-zero line 13: __MACRO
macro '__USE_ISOC95' is #undefined twice
macro `__STDC_LIMIT_MACROS' redefined at...
Member function "~basic_istream" (offset 28792309) is virtual but the
vtable offset is not specified...

Maybe gdb needs to fall back to slower line number resolution because e.g.
low and high bounds cannot be retrieved and debug_line_address is 0?

Best regards,
Martin

Rafael Avila de Espindola schrieb:

Martin Richtarsky <s@martinien.de> writes:

Hi,

Is the program being built by gcc or clang?

gcc 6, but I can try clang.

Are both using --gdb-index? Can you try lld trunk if so?

No.

Is any of the programs you tested open source?

No.

Here is some more info. I enabled "set verbose on" in gdb. With this
setting, I'm getting DWARF errors in both versions when executing "bt"
(linked by gold and lld), but the lld version has significantly more
errors. These only appear in the lld version:

(gdb) bt
Reading in symbols for abc.cpp....
debug_line address at offset 0x575128 is 0 [in module abc.so]....
debug_line address at offset 0x575141 is 0 [in module abc.so]....
cannot get low and high bounds for subprogram DIE at 52844002...
cannot get low and high bounds for subprogram DIE at 52844042...
cannot get low and high bounds for subprogram DIE at 52844092...
const value length mismatch for 'std::ios_constants::boolalpha', got 4,
expected 0
debug info gives command-line macro definition with non-zero line 13: __MACRO
macro '__USE_ISOC95' is #undefined twice
macro `__STDC_LIMIT_MACROS' redefined at...
Member function "~basic_istream" (offset 28792309) is virtual but the
vtable offset is not specified...

Maybe gdb needs to fall back to slower line number resolution because e.g.
low and high bounds cannot be retrieved and debug_line_address is 0?

It is hard to know without a reproducible. I tried gdb on clang itself
build with both clang and gcc and linked with gold and lld. I could not
reproduce the slowdown, but I was using trunk lld.

Cheers,
Rafael

Rafael Avila de Espindola wrote :

Maybe gdb needs to fall back to slower line number resolution because
e.g.
low and high bounds cannot be retrieved and debug_line_address is 0?

It is hard to know without a reproducible. I tried gdb on clang itself
build with both clang and gcc and linked with gold and lld. I could not
reproduce the slowdown, but I was using trunk lld.

I will retry with clang trunk, when it reproduces I will build some other
large project (that has DSOs) using our compile/link options (they are not
that special, I think).

Best regards,
Martin

Martin Richtarsky <s@martinien.de> writes:

Rafael Avila de Espindola wrote :

Maybe gdb needs to fall back to slower line number resolution because
e.g.
low and high bounds cannot be retrieved and debug_line_address is 0?

It is hard to know without a reproducible. I tried gdb on clang itself
build with both clang and gcc and linked with gold and lld. I could not
reproduce the slowdown, but I was using trunk lld.

I will retry with clang trunk, when it reproduces I will build some other
large project (that has DSOs) using our compile/link options (they are not
that special, I think).

If you can try lld trunk too that would be awesome.

Cheers,
Rafael

Rafael Avila de Espindola wrote:

I will retry with clang trunk, when it reproduces I will build some
other
large project (that has DSOs) using our compile/link options (they are
not
that special, I think).

If you can try lld trunk too that would be awesome.

I meant lld trunk :slight_smile:

The problem goes away when building with clang 4.0 and linking with lld (a
version from trunk maybe 6 weeks before the 5.0.0 release). However the
compile options are different in that case, e.g. with respect to -Ox and
-gx, so it's perhaps a bit apples to oranges.

When building with gcc 6.2.1 and linking with lld trunk, I get a link error:

bin-lld/ld: error: lib/libse.a(file1.cpp.o): unaligned data

What would be helpful to diagnose this?

Best regards,
Martin

Martin Richtarsky <s@martinien.de> writes:

Rafael Avila de Espindola wrote:

I will retry with clang trunk, when it reproduces I will build some
other
large project (that has DSOs) using our compile/link options (they are
not
that special, I think).

If you can try lld trunk too that would be awesome.

I meant lld trunk :slight_smile:

The problem goes away when building with clang 4.0 and linking with lld (a
version from trunk maybe 6 weeks before the 5.0.0 release). However the
compile options are different in that case, e.g. with respect to -Ox and
-gx, so it's perhaps a bit apples to oranges.

When building with gcc 6.2.1 and linking with lld trunk, I get a link error:

bin-lld/ld: error: lib/libse.a(file1.cpp.o): unaligned data

That means that file1.cpp.o has an invalid sh_offset. Can you post a
readelf -SW of it? How is it being created?

The error is from ELF.h: ELFFile<ELFT>::getSectionContentsAsArray.

Cheers,
Rafael

When building with gcc 6.2.1 and linking with lld trunk, I get a link
error:

bin-lld/ld: error: lib/libse.a(file1.cpp.o): unaligned data

That means that file1.cpp.o has an invalid sh_offset. Can you post a
readelf -SW of it? How is it being created?

The error is from ELF.h: ELFFile<ELFT>::getSectionContentsAsArray.

Output looks as follows [1] Seems sh_offset is missing?

The object file was created with ICC (not sure which version)

Best regards,
Martin

[1]
There are 25 section headers, starting at offset 0xd5e50:

Section Headers:
  [Nr] Name Type Address Off Size ES
Flg Lk Inf Al
  [ 0] NULL 0000000000000000 000000 000000 00
     0 0 0
  [ 1] .symtab SYMTAB 0000000000000000 000040 000228 18
    16 17 4
  [ 2] .data PROGBITS 0000000000000000 000268 000000 00
WA 0 0 4
  [ 3] .bss NOBITS 0000000000000000 000268 000000 00
WA 0 0 4
  [ 4] .text PROGBITS 0000000000000000 000268 010330 00
AX 0 0 16
  [ 5] .rodata PROGBITS 0000000000000000 010598 001460 00
  A 0 0 32
  [ 6] .debug_opt_report PROGBITS 0000000000000000 0119f8 002d52 00
     0 0 1
  [ 7] .note.GNU-stack NOTE 0000000000000000 01474a 000000 00
     0 0 1
  [ 8] .debug_info PROGBITS 0000000000000000 01474a 020d66 00
     0 0 1
  [ 9] .debug_line PROGBITS 0000000000000000 0354b0 00968b 00
     0 0 1
  [10] .debug_abbrev PROGBITS 0000000000000000 03eb3b 0003ea 00
     0 0 1
  [11] .debug_frame PROGBITS 0000000000000000 03ef25 001220 00
     0 0 1
  [12] .debug_str PROGBITS 0000000000000000 040145 018b94 01
MS 0 0 1
  [13] .eh_frame PROGBITS 0000000000000000 058cd9 001220 00
  A 0 0 8
  [14] .debug_ranges PROGBITS 0000000000000000 059ef9 016000 00
     0 0 1
  [15] .gnu.linkonce.d.DW.ref.__gxx_personality_v0 PROGBITS
0000000000000000 06fef9 000008 00 A 0 0 1
  [16] .strtab STRTAB 0000000000000000 06ff01 001522 00
     0 0 1
  [17] .rela.text RELA 0000000000000000 071423 001728 18
     1 4 8
  [18] .rela.debug_opt_report RELA 0000000000000000 072b4b
001fb0 18 1 6 8
  [19] .rela.debug_info RELA 0000000000000000 074afb 028680 18
     1 8 8
  [20] .rela.debug_line RELA 0000000000000000 09d17b 000048 18
     1 9 8
  [21] .rela.debug_frame RELA 0000000000000000 09d1c3 000090 18
     1 11 8
  [22] .rela.eh_frame RELA 0000000000000000 09d253 000060 18
     1 13 8
  [23] .rela.debug_ranges RELA 0000000000000000 09d2b3 038b80
18 1 14 8
  [24] .rela.gnu.linkonce.d.DW.ref.__gxx_personality_v0 RELA
0000000000000000 0d5e33 000018 18 1 15 8
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings), l (large)
  I (info), L (link order), G (group), T (TLS), E (exclude), x (unknown)
  O (extra OS processing required) o (OS specific), p (processor specific)

Martin Richtarsky <s@martinien.de> writes:

Output looks as follows [1] Seems sh_offset is missing?

That is what readelf prints as Off

  [17] .rela.text RELA 0000000000000000 071423 001728 18
     1 4 8

The offset of rela text should have been aligned, but it is not. Can you
report a bug on icc? As a work around using the gnu assembler if
possible should fix this.

Cheers,
Rafael

Yeah this is a violation of the spec and must be a bug in ICC. That being
said, is there a practical benefit of checking the validity of the
alignment, except finding buggy object files early? I mean, if an object
file is in an static archive, all "aligned" data in the object file might
not be aligned against the beginning of the archive file.

Rui Ueyama <ruiu@google.com> writes:

Martin Richtarsky <s@martinien.de> writes:

> Output looks as follows [1] Seems sh_offset is missing?

That is what readelf prints as Off

> [17] .rela.text RELA 0000000000000000 071423 001728
18
> 1 4 8

The offset of rela text should have been aligned, but it is not. Can you
report a bug on icc? As a work around using the gnu assembler if
possible should fix this.

Yeah this is a violation of the spec and must be a bug in ICC. That being
said, is there a practical benefit of checking the validity of the
alignment, except finding buggy object files early? I mean, if an object
file is in an static archive, all "aligned" data in the object file might
not be aligned against the beginning of the archive file.

It will at least be aligned to two bytes.

With most current host architectures handling
packed_endian_specific_integral is fairly efficient. For example, on
x86_64 reading 32 bits with 1 2 and 4 byte alignment produces in all
cases:

  movl (%rdi), %eax

But on armv6 the aligned case is

  ldr r0, [r0]

the 2 byte aligned case is

  ldrh r1, [r0, #2]
  ldrh r0, [r0]
  orr r0, r0, r1, lsl #16

and the unaligned case is

  ldrb r1, [r0]
  ldrb r2, [r0, #1]
  ldrb r3, [r0, #2]
  ldrb r0, [r0, #3]
  orr r1, r1, r2, lsl #8
  orr r0, r3, r0, lsl #8
  orr r0, r1, r0, lsl #16

On armv7 it is a single ldr on all cases.

Now, I don't really know how much we support *host* architectures
without a unaligned load instruction. If we don't care about making lld
and other llvm tools slower on those host architectures we could use
packed_endian_specific_integral with an alignment of 1 and remove the
check. I guess we have to ask on llvmdev before changing that.

Cheers,
Rafael

Somewhat orthogonal to the original issue, but if object files are aligned only to two bytes in a static archive, and if we are using the four byte aligned load instruction on armv6 to load data from object files, that means current LLVM can easily cause a bus error on armv6, no?

Rui Ueyama <ruiu@google.com> writes:

Somewhat orthogonal to the original issue, but if object files are aligned
only to two bytes in a static archive, and if we are using the four byte
aligned load instruction on armv6 to load data from object files, that
means current LLVM can easily cause a bus error on armv6, no?

We are not using a 4 byte aligned load. We are using:

the 2 byte aligned case is

  ldrh r1, [r0, #2]
  ldrh r0, [r0]
  orr r0, r0, r1, lsl #16

That is why the check for the section being at least 2 byte aligned is
important.

The 2 is from

  using Elf_Word = support::detail::packed_endian_specific_integral<
      uint32_t, target_endianness, 2>;

Cheers,
Rafael

Got it. Thanks. I’d vote for just reporting the issue to Intel and keep our existing behavior.