[RFC] Changing .llvm.call-graph-profile to use relocations

Currently when .llvm.call-graph-profile is created by llvm it explicitly encodes the symbol indices. This section is basically a black box for post processing tools. For example, if we run strip -s on the object files the symbol table changes, but indices in that section do not.

We propose to change this section to use relocations. The Frequency will still be in the .llvm.call-graph-profile, but symbol information will be in relocation section. In LLD information from both sections is used to reconstruct call graph profile. Relocations themselves will never be applied.

With this approach post processing tools that handle relocations correctly work for this section also.

One thing is section is marked with SHF_EXCLUDE.

From spec
“This section is excluded from input to the link-edit of an executable or shared object. This flag is ignored if the SHF_ALLOC flag is also set, or if relocations exist against the section.”

So technically speaking it needs to be kept, and presumably relocations applied, but LLD follows gold and ld and discards sections marked as SHF_EXCLUDE even with relocations. So, I think this approach should be fine. https://reviews.llvm.org/D24966

Finally, this bug seems similar to https://sourceware.org/bugzilla/show_bug.cgi?id=23817. Proposed solution for that was also to use relocations.

Implementation: https://reviews.llvm.org/D103212

Currently when .llvm.call-graph-profile is created by llvm it explicitly encodes the symbol indices. This section is basically a black box for post processing tools. For example, if we run strip -s on the object files the symbol table changes, but indices in that section do not.

We propose to change this section to use relocations. The Frequency will still be in the .llvm.call-graph-profile, but symbol information will be in relocation section. In LLD information from both sections is used to reconstruct call graph profile. Relocations themselves will never be applied.

Yes, a relocation based approach will be more robust and can fix the
.llvm_addrsig issue 23817 – strip and ld -r corrupt SHT_LLVM_ADDRSIG section (files built with clang-7)

The relocation type doesn't matter. Your implementation uses R_X86_64_32 and from/to/value/from/to/value/from/to/value
An alternative design is R_X86_64_NONE + value/value/value, i.e. from/to do not occupy space in the content.
We will get a 3x space saving.

We need to change the section type since changing representation is incompatible
and the sections from old object files should be ignored.
The new section type will be ignored by old LLVM tools as well.

With this approach post processing tools that handle relocations correctly work for this section also.

One thing is section is marked with SHF_EXCLUDE.
From spec
"This section is excluded from input to the link-edit of an executable or shared object. This flag is ignored if the SHF_ALLOC flag is also set, or if relocations exist against the section."

So technically speaking it needs to be kept, and presumably relocations applied, but LLD follows gold and ld and discards sections marked as SHF_EXCLUDE even with relocations. So, I think this approach should be fine. ⚙ D24966 [ELF] Ignore sections flagged with SHF_EXCLUDE

This is Solaris's spec (since 1996), not standard ELF's. It can be advisory but
our behavior (mostly GNU ABI+Linux ABI+LLVM extensions) does not necessarily
follow it. I cannot find a definition in the x86-64 psABI or a GNU ABI
document.

I think the behavior as implemented in gold and LLD is more useful.

Thank you for feedback, let me try to use R_X86_64_NONE, and introduce another type. . I thought R_X86_64_32 would be less impactful as structure of is preserved, but it is space wasteful.
For type name: ELF::SHT_LLVM_CALL_GRAPH_PROFILE_RELA ?

Alex

Thank you for feedback, let me try to use R_X86_64_NONE, and introduce another type. . I thought R_X86_64_32 would be less impactful as structure of is preserved, but it is space wasteful.
For type name: ELF::SHT_LLVM_CALL_GRAPH_PROFILE_RELA ?

Alex

Preversing the structure is not needed because the symbol representation
is changed anyway.

You just need to change the value in
llvm/llvm/include/llvm/BinaryFormat/ELF.h
The name doesn't need to change.

My understanding is that both call graph info and addrsig info use symbol table indices because they are considerably smaller than relocations (4 bytes vs 3x pointers, or something). Won’t switching to relocations increase object file size?

I think the approach was inspired by some COFF sections, where this technique is used to mark safe exception handlers, address-taken functions, longjmp targets, and other things. MSVC doesn’t have a strip utility, so the only tool you need to handshake with is the linker.

I was thinking keeping both structures for backward compatibility in case object files with old representation are fed in to new llvm-objdump, and even lld. For example if someone will use older clang release with lld/llvm-objdump after this change.

For just changing the value and keeping the name. Looks like it will leave a gap in this sequential sequence. If a new flag is added later down the road and in this case latest clang used with older lld/llvm-objdump this will also present a problem as older tools will interpret new flag as SHT_LLVM_CALL_GRAPH_PROFILE.
Not sure if these are valid concerns, what do you think?
If we go for clean state, then maybe leave 0x6fff4c02 entry and change it to SHT_LLVM_CALL_GRAPH_PROFILE_DEPRICATED, and emit warning if objects with it are passed in.

SHT_LLVM_ODRTAB = 0x6fff4c00, // LLVM ODR table.
SHT_LLVM_LINKER_OPTIONS = 0x6fff4c01, // LLVM Linker Options.
SHT_LLVM_CALL_GRAPH_PROFILE = 0x6fff4c02, // LLVM Call Graph Profile.
SHT_LLVM_ADDRSIG = 0x6fff4c03, // List of address-significant symbols
// for safe ICF.
SHT_LLVM_DEPENDENT_LIBRARIES =
0x6fff4c04, // LLVM Dependent Library Specifiers.
SHT_LLVM_SYMPART = 0x6fff4c05, // Symbol partition specification.
SHT_LLVM_PART_EHDR = 0x6fff4c06, // ELF header for loadable partition.
SHT_LLVM_PART_PHDR = 0x6fff4c07, // Phdrs for loadable partition.
SHT_LLVM_BB_ADDR_MAP = 0x6fff4c08, // LLVM Basic Block Address Map.

I was thinking keeping both structures for backward compatibility in case object files with old representation are fed in to new llvm-objdump, and even lld. For example if someone will use older clang release with lld/llvm-objdump after this change.
For just changing the value and keeping the name. Looks like it will leave a gap in this sequential sequence. If a new flag is added later down the road and in this case latest clang used with older lld/llvm-objdump this will also present a problem as older tools will interpret new flag as SHT_LLVM_CALL_GRAPH_PROFILE.
Not sure if these are valid concerns, what do you think?
If we go for clean state, then maybe leave 0x6fff4c02 entry and change it to SHT_LLVM_CALL_GRAPH_PROFILE_DEPRICATED, and emit warning if objects with it are passed in.
SHT_LLVM_ODRTAB = 0x6fff4c00, // LLVM ODR table.
SHT_LLVM_LINKER_OPTIONS = 0x6fff4c01, // LLVM Linker Options.
SHT_LLVM_CALL_GRAPH_PROFILE = 0x6fff4c02, // LLVM Call Graph Profile.
SHT_LLVM_ADDRSIG = 0x6fff4c03, // List of address-significant symbols
                                // for safe ICF.
SHT_LLVM_DEPENDENT_LIBRARIES =
     0x6fff4c04, // LLVM Dependent Library Specifiers.
SHT_LLVM_SYMPART = 0x6fff4c05, // Symbol partition specification.
SHT_LLVM_PART_EHDR = 0x6fff4c06, // ELF header for loadable partition.
SHT_LLVM_PART_PHDR = 0x6fff4c07, // Phdrs for loadable partition.
SHT_LLVM_BB_ADDR_MAP = 0x6fff4c08, // LLVM Basic Block Address Map.

Do you have measurement how well SHT_LLVM_CALL_GRAPH_PROFILE optimizes?
My understanding is that with ThinLTO+PGO it is has very tiny benefit.

The value reading old format is low. I don't expect users want to mix
newer object files with old object files and want to have optimization
from the old object files. SHT_LLVM_CALL_GRAPH_PROFILE_DEPRICATED is
just adding maintenance cost.

Changing the value but retaining the name is sufficient for the various
compatibility issues.

Reid's size concern is valid. Have you measured the size overhead?
We can use the SHT_REL format for SHT_LLVM_CALL_GRAPH_PROFILE
relocations, which brings down the per entry cost from 24 bytes to 16
bytes for ELFCLASS64. (It is a bit unfortunate that the Linux kernel
does not support ELFCLASS32 executables on a 64-bit architecture.
For most usage we are using the small code model and don't benefit
from 64-bit addresses/sizes.)

Replies inlined.

Do you have measurement how well SHT_LLVM_CALL_GRAPH_PROFILE optimizes?
My understanding is that with ThinLTO+PGO it is has very tiny benefit.

With PLO(BOLT) the benefit is zero, but with PGO and without PLO, it was still visible. We’ve seen ~1% RPS improvement in the past when measuring mysql.

PGO does not cover profile guided function layout, either the linker or PLO need to take care of that part. We’ve been using standalone HFSort on top of PGO, and we wanted to fold that into either PLO or linker – the profile guided function layout exists in lld, but we ran into bugs hence the RFC and Alexander’s fix.

Thanks,

Wenlei

Created a new diff with the _NONE implementation:

https://reviews.llvm.org/D104080

Created a new diff with the _NONE implementation:
⚙ D104080 [LLD][LLVM] CG Graph profile using relocations
________________________________
From: llvm-dev <llvm-dev-bounces@lists.llvm.org> on behalf of Alexander Yermolovich via llvm-dev <llvm-dev@lists.llvm.org>
Sent: Thursday, May 27, 2021 5:44 PM
To: maskray@google.com <maskray@google.com>
Cc: llvm-dev@lists.llvm.org <llvm-dev@lists.llvm.org>
Subject: Re: [llvm-dev] [RFC] Changing .llvm.call-graph-profile to use relocations

Replies inlined.

________________________________
From: maskray@google.com <maskray@google.com>
Sent: Thursday, May 27, 2021 5:23 PM
To: Alexander Yermolovich <ayermolo@fb.com>
Cc: llvm-dev@lists.llvm.org <llvm-dev@lists.llvm.org>; Wenlei He <wenlei@fb.com>; Modi Mo <modimo@fb.com>; dblaikie@gmail.com <dblaikie@gmail.com>
Subject: Re: [RFC] Changing .llvm.call-graph-profile to use relocations

>I was thinking keeping both structures for backward compatibility in case object files with old representation are fed in to new llvm-objdump, and even lld. For example if someone will use older clang release with lld/llvm-objdump after this change.
>For just changing the value and keeping the name. Looks like it will leave a gap in this sequential sequence. If a new flag is added later down the road and in this case latest clang used with older lld/llvm-objdump this will also present a problem as older tools will interpret new flag as SHT_LLVM_CALL_GRAPH_PROFILE.
>Not sure if these are valid concerns, what do you think?
>If we go for clean state, then maybe leave 0x6fff4c02 entry and change it to SHT_LLVM_CALL_GRAPH_PROFILE_DEPRICATED, and emit warning if objects with it are passed in.
> SHT_LLVM_ODRTAB = 0x6fff4c00, // LLVM ODR table.
> SHT_LLVM_LINKER_OPTIONS = 0x6fff4c01, // LLVM Linker Options.
> SHT_LLVM_CALL_GRAPH_PROFILE = 0x6fff4c02, // LLVM Call Graph Profile.
> SHT_LLVM_ADDRSIG = 0x6fff4c03, // List of address-significant symbols
> // for safe ICF.
> SHT_LLVM_DEPENDENT_LIBRARIES =
> 0x6fff4c04, // LLVM Dependent Library Specifiers.
> SHT_LLVM_SYMPART = 0x6fff4c05, // Symbol partition specification.
> SHT_LLVM_PART_EHDR = 0x6fff4c06, // ELF header for loadable partition.
> SHT_LLVM_PART_PHDR = 0x6fff4c07, // Phdrs for loadable partition.
> SHT_LLVM_BB_ADDR_MAP = 0x6fff4c08, // LLVM Basic Block Address Map.

Do you have measurement how well SHT_LLVM_CALL_GRAPH_PROFILE optimizes?
My understanding is that with ThinLTO+PGO it is has very tiny benefit.
[Alex] I do not. Wenlei mentioned he can dig up some numbers.

The value reading old format is low. I don't expect users want to mix
newer object files with old object files and want to have optimization
from the old object files. SHT_LLVM_CALL_GRAPH_PROFILE_DEPRICATED is
just adding maintenance cost.

Changing the value but retaining the name is sufficient for the various
compatibility issues.

Reid's size concern is valid. Have you measured the size overhead?
[Alex] Good point. I'll measure once I implement new approach with SHT_REL + R_X86_64_NONE.

We can use the SHT_REL format for SHT_LLVM_CALL_GRAPH_PROFILE
relocations, which brings down the per entry cost from 24 bytes to 16
bytes for ELFCLASS64. (It is a bit unfortunate that the Linux kernel
does not support ELFCLASS32 executables on a 64-bit architecture.
For most usage we are using the small code model and don't benefit
from 64-bit addresses/sizes.)

>________________________________
>From: maskray@google.com <maskray@google.com>
>Sent: Thursday, May 27, 2021 11:27 AM
>To: Alexander Yermolovich <ayermolo@fb.com>
>Cc: llvm-dev@lists.llvm.org <llvm-dev@lists.llvm.org>; Wenlei He <wenlei@fb.com>; Modi Mo <modimo@fb.com>; dblaikie@gmail.com <dblaikie@gmail.com>
>Subject: Re: [RFC] Changing .llvm.call-graph-profile to use relocations
>
>>Thank you for feedback, let me try to use R_X86_64_NONE, and introduce another type. . I thought R_X86_64_32 would be less impactful as structure of is preserved, but it is space wasteful.
>>For type name: ELF::SHT_LLVM_CALL_GRAPH_PROFILE_RELA ?
>>
>>Alex
>
>Preversing the structure is not needed because the symbol representation
>is changed anyway.
>
>You just need to change the value in
>llvm/llvm/include/llvm/BinaryFormat/ELF.h
>The name doesn't need to change.
>
>>From: maskray@google.com <maskray@google.com>
>>Sent: Wednesday, May 26, 2021 5:03 PM
>>To: Alexander Yermolovich <ayermolo@fb.com>
>>Cc: llvm-dev@lists.llvm.org <llvm-dev@lists.llvm.org>; Wenlei He <wenlei@fb.com>; Modi Mo <modimo@fb.com>; dblaikie@gmail.com <dblaikie@gmail.com>
>>Subject: Re: [RFC] Changing .llvm.call-graph-profile to use relocations
>>
>>>Currently when .llvm.call-graph-profile is created by llvm it explicitly encodes the symbol indices. This section is basically a black box for post processing tools. For example, if we run strip -s on the object files the symbol table changes, but indices in that section do not.
>>>
>>>We propose to change this section to use relocations. The Frequency will still be in the .llvm.call-graph-profile, but symbol information will be in relocation section. In LLD information from both sections is used to reconstruct call graph profile. Relocations themselves will never be applied.
>>
>>Yes, a relocation based approach will be more robust and can fix the
>>.llvm_addrsig issue 23817 – strip and ld -r corrupt SHT_LLVM_ADDRSIG section (files built with clang-7)
>>
>>The relocation type doesn't matter. Your implementation uses R_X86_64_32 and from/to/value/from/to/value/from/to/value
>>An alternative design is R_X86_64_NONE + value/value/value, i.e. from/to do not occupy space in the content.
>>We will get a 3x space saving.
>>
>>We need to change the section type since changing representation is incompatible
>>and the sections from old object files should be ignored.
>>The new section type will be ignored by old LLVM tools as well.
>>
>>>With this approach post processing tools that handle relocations correctly work for this section also.
>>>
>>>One thing is section is marked with SHF_EXCLUDE.
>>>From spec
>>>"This section is excluded from input to the link-edit of an executable or shared object. This flag is ignored if the SHF_ALLOC flag is also set, or if relocations exist against the section."
>>>
>>>So technically speaking it needs to be kept, and presumably relocations applied, but LLD follows gold and ld and discards sections marked as SHF_EXCLUDE even with relocations. So, I think this approach should be fine. ⚙ D24966 [ELF] Ignore sections flagged with SHF_EXCLUDE
>>
>>This is Solaris's spec (since 1996), not standard ELF's. It can be advisory but
>>our behavior (mostly GNU ABI+Linux ABI+LLVM extensions) does not necessarily
>>follow it. I cannot find a definition in the x86-64 psABI or a GNU ABI
>>document.
>>
>>I think the behavior as implemented in gold and LLD is more useful.
>>
>>>Finally, this bug seems similar to 23817 – strip and ld -r corrupt SHT_LLVM_ADDRSIG section (files built with clang-7) . Proposed solution for that was also to use relocations.
>>>
>>>Implementation: ⚙ D103212 [WIP] Changing CG Profile to relocations based.

Circling back on this. ⚙ D104080 [LLD][LLVM] CG Graph profile using relocations mostly looks good.

.llvm.call-graph-profile mostly only affects -fprofile-use and
-fprofile-sample-use(-fauto-profile) produced output files.

From the measurement for clang-13 the size looks really matter.

Also note that for for instrumentation PGO, -fprofile-generate= object
files are much larger than -fprofile-use=, so -fprofile-use= slightly
size increases don't matter.