[llvm-mc] FreeBSD kernel module performance impact when upgrading clang

Hi,

I'm in the process of migrating from clang5 to clang10. Unfortunately clang10 introduced a negative performance impact. The cause is an increase of PLT entries from this patch (first released in clang7):

https://bugs.llvm.org/show_bug.cgi?id=36370
https://reviews.llvm.org/D43383

If I revert that clang patch locally, the additional PLT entries and the performance impact disappear.

This occurs in the context of FreeBSD kernel modules. Using the example code from this page:

...I can explain what I'm seeing. If I compare the objects generated by clang5 and clang10 for the example:

clang5:

  $ objdump -r skeleton.o
  
  skeleton.o: file format elf64-x86-64
  
  RELOCATION RECORDS FOR [.text]:
  OFFSET TYPE VALUE
  0000000000000019 R_X86_64_32S .rodata.str1.1+0x0000000000000015
  0000000000000024 R_X86_64_32S .rodata.str1.1+0x000000000000002b
  000000000000002b R_X86_64_PC32 uprintf-0x0000000000000004
  [...]

clang10:

  $ objdump -r skeleton.o
  
  skeleton.o: file format elf64-x86-64
  
  RELOCATION RECORDS FOR [.text]:
  OFFSET TYPE VALUE
  0000000000000017 R_X86_64_32S .rodata.str1.1+0x000000000000002b
  0000000000000020 R_X86_64_32S .rodata.str1.1+0x0000000000000015
  0000000000000029 R_X86_64_PLT32 uprintf-0x0000000000000004
  [...]

The relocation for the external uprintf call is changed from R_X86_64_PC32 to R_X86_64_PLT32.

Normally, amd64/x86 kernel modules are relocatable object files (via ld -r). Because of that, D43383 typically has no impact as the FreeBSD loader sees the relocations directly and treats R_X86_64_PC32 and R_X86_64_PLT32 the same:

https://github.com/freebsd/freebsd/blob/master/sys/amd64/amd64/elf_machdep.c#L321

But in my case, the kernel objects are created as shared objects. Using shared objects is atypical for amd64, but done for every other architecture except mips:

https://github.com/freebsd/freebsd/blob/master/sys/conf/kmod.mk#L81

The comments in the D43383 review suggest that a modern linker should reduce the PLT32 relocations to PC32 for local calls. But I do not see that reduction even when testing this and other examples with lld 10. My understanding is this is due to the shared kernel objects. The relocations are being processed (and left as PLT) prior to the kernel loader ever seeing them. Unfortunately this means many calls that previously did not go through the PLT now do.

Note that allowing R_X86_64_PC32 within shared objects (without -fPIC) requires a linker patch. This works within a kernel environment even if it should be disallowed elsewhere. But it reveals the larger question raised by the patch and its impact: whose responsibility should this behavior be?

It seems the linker/lld should supply an equivalent of -mcmodel=kernel, e.g. indicating 64-bit pointers will fit in a 32-bit address space. (Stated another way: it seems appropriate to allow users to 32-bit sign extend relocations in shared libraries if they specify some sort of kernel mode.)

From there though, is the linker the place to eliminate PLT relocations for this use case?

Or should the compiler be the one to specify the "right" relocations, meaning the D43383 patch should be modified to emit a different relocation for -mcmodel=kernel?

In summary:

1. Could you please clarify for me the conditions under which the PLT->PC relocation reduction should occur?
2. Given the goal of eliminating unneeded PLT entries from shared kernel objects: should the linker or the compiler be responsible for doing the right thing?

Thanks,

Justin

Hi Justin,

I can answer your first question

Could you please clarify for me the conditions under which the PLT->PC relocation reduction should occur?

Below there is a description of LLD linker behavior with links to source code,
which also contains some usefull comments.

1) A symbol should be not preemtible. The responsible function that is used to set
   this flag is `bool elf::computeIsPreemptible(const Symbol &sym)`
   (https://github.com/llvm/llvm-project/blob/master/lld/ELF/Symbols.cpp#L353)

   When the configuration is `--shared` it can be one of the reasons why a symbol becomes
   preemtible. But `--Bsymbolic-functions`/`--Bsymbolic` options can
   be used to force it be non-preemtible.

2) Then when we know that a PLT entry will be resolved within the same ELF module, we
   can skip PLT access and directly jump to the destination function.
   The responsible code that performs PLT->PC relaxation is the "expr = fromPlt(expr);"
   line in `scanReloc(...)`
   (https://github.com/llvm/llvm-project/blob/master/lld/ELF/Relocations.cpp#L1377)
   
Hope that clarifies the situation a bit.

Best regards,
George | Developer | Access Softek, Inc

Hi Justin,

I can answer your first question

I can answer some other questions

Could you please clarify for me the conditions under which the PLT->PC relocation reduction should occur?

Below there is a description of LLD linker behavior with links to source code,
which also contains some usefull comments.

1) A symbol should be not preemtible. The responsible function that is used to set
  this flag is `bool elf::computeIsPreemptible(const Symbol &sym)`
  (https://github.com/llvm/llvm-project/blob/master/lld/ELF/Symbols.cpp#L353)

  When the configuration is `--shared` it can be one of the reasons why a symbol becomes
  preemtible. But `--Bsymbolic-functions`/`--Bsymbolic` options can
  be used to force it be non-preemtible.

2) Then when we know that a PLT entry will be resolved within the same ELF module, we
  can skip PLT access and directly jump to the destination function.
  The responsible code that performs PLT->PC relaxation is the "expr = fromPlt(expr);"
  line in `scanReloc(...)`
  (https://github.com/llvm/llvm-project/blob/master/lld/ELF/Relocations.cpp#L1377)

Hope that clarifies the situation a bit.

Best regards,
George | Developer | Access Softek, Inc

________________________________________
От: Justin Cady <desk@justincady.com>
Отправлено: 2 ноября 2020 г. 22:00
Кому: llvm-dev@lists.llvm.org
Копия: George Rimar
Тема: [EXTERNAL] [llvm-mc] FreeBSD kernel module performance impact when upgrading clang

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe. If you suspect potential phishing or spam email, report it to ReportSpam@accesssoftek.com

Hi,

I'm in the process of migrating from clang5 to clang10. Unfortunately clang10 introduced a negative performance impact. The cause is an increase of PLT entries from this patch (first released in clang7):

https://bugs.llvm.org/show_bug.cgi?id=36370
https://reviews.llvm.org/D43383

If I revert that clang patch locally, the additional PLT entries and the performance impact disappear.

This occurs in the context of FreeBSD kernel modules. Using the example code from this page:

Chapter 9. Writing FreeBSD Device Drivers | FreeBSD Documentation Portal

...I can explain what I'm seeing. If I compare the objects generated by clang5 and clang10 for the example:

clang5:

       $ objdump -r skeleton.o

       skeleton.o: file format elf64-x86-64

       RELOCATION RECORDS FOR [.text]:
       OFFSET TYPE VALUE
       0000000000000019 R_X86_64_32S .rodata.str1.1+0x0000000000000015
       0000000000000024 R_X86_64_32S .rodata.str1.1+0x000000000000002b
       000000000000002b R_X86_64_PC32 uprintf-0x0000000000000004
       [...]

clang10:

       $ objdump -r skeleton.o

       skeleton.o: file format elf64-x86-64

       RELOCATION RECORDS FOR [.text]:
       OFFSET TYPE VALUE
       0000000000000017 R_X86_64_32S .rodata.str1.1+0x000000000000002b
       0000000000000020 R_X86_64_32S .rodata.str1.1+0x0000000000000015
       0000000000000029 R_X86_64_PLT32 uprintf-0x0000000000000004
       [...]

The relocation for the external uprintf call is changed from R_X86_64_PC32 to R_X86_64_PLT32.

The GNU as change is (binutils-gdb):

commit bd7ab16b4537788ad53521c45469a1bdae84ad4a
Author: H.J. Lu

     x86-64: Generate branch with PLT32 relocation

R_X86_64_PLT32 is a bit of a misnomer. The actual intention it conveys
is that this is a function call/jump operation and the address of the
target symbol is **insignificant**: which means the jump target can be
the actual function or a PLT entry.

For

   .globl _start, foo
   _start:
     .byte 0xe8
     .reloc ., R_X86_64_PC32, foo - 4
     .long foo - .
      foo:
     ret

as a.s -o a.o
ld.lld -shared a.o # error: relocation R_X86_64_PC32 cannot be used against symbol foo; recompil

If you are using an older linker (either GNU ld or LLD), it may not have such a diagnostic.

Normally, amd64/x86 kernel modules are relocatable object files (via ld -r). Because of that, D43383 typically has no impact as the FreeBSD loader sees the relocations directly and treats R_X86_64_PC32 and R_X86_64_PLT32 the same:

freebsd-src/elf_machdep.c at master · freebsd/freebsd-src · GitHub

But in my case, the kernel objects are created as shared objects. Using shared objects is atypical for amd64, but done for every other architecture except mips:

freebsd-src/kmod.mk at master · freebsd/freebsd-src · GitHub

The comments in the D43383 review suggest that a modern linker should reduce the PLT32 relocations to PC32 for local calls. But I do not see that reduction even when testing this and other examples with lld 10. My understanding is this is due to the shared kernel objects. The relocations are being processed (and left as PLT) prior to the kernel loader ever seeing them. Unfortunately this means many calls that previously did not go through the PLT now do.

Note that allowing R_X86_64_PC32 within shared objects (without -fPIC) requires a linker patch. This works within a kernel environment even if it should be disallowed elsewhere. But it reveals the larger question raised by the patch and its impact: whose responsibility should this behavior be?

The user code's responsibility. See George's reply about computeIsPreemptible. There are many ways to make a defined symbol in -shared mode non-preemptible:

* visibility (usually via STV_HIDDEN; for a function, if you don't take the address in any -fno-pic translation unit, STV_PROTECTED can be used as well if you do want to export the function)
* -Bsymbolic, -Bsymbolic-functions
* --dynamic-list
* local: in --version-script

It seems the linker/lld should supply an equivalent of -mcmodel=kernel, e.g. indicating 64-bit pointers will fit in a 32-bit address space. (Stated another way: it seems appropriate to allow users to 32-bit sign extend relocations in shared libraries if they specify some sort of kernel mode.)

There is no such a need for a new option. These desired characteristics can be achieved with existing mechanisms.

Thank you both for your replies, the links to the source, and the details about preemptible symbols.

If you are using an older linker (either GNU ld or LLD), it may not have such a diagnostic.

Thank you for this concise example. The linker does in fact have this diagnostic, but it is being ignored.

So, for:

.globl _start, foo
_start:
   .byte 0xe8
   .reloc ., R_X86_64_PC32, foo - 4
   .long foo - .
   foo:
   ret

$ as a.s -o a.o
$ ld.lld-10 -noinhibit-exec -shared -o a.so a.o # ld.lld-10: warning: relocation R_X86_64_PC32 cannot be used against symbol foo; recompile with -fPIC
$ objdump -d a.so

a.so: file format elf64-x86-64

Disassembly of section .text:

0000000000001298 <_start>:
    1298: e8 04 00 00 00 callq 12a1 <foo+0x4>

000000000000129d <foo>:
    129d: c3 retq

The call to foo does not go through the PLT. That's the behavior seen using clang5. But clang10 generates the PLT32 relocation instead, like this:

$ cat b.s
.globl _start, foo
_start:
   .byte 0xe8
   .reloc ., R_X86_64_PLT32, foo - 4
   .long foo - .
   foo:
   ret

$ as b.s -o b.o
$ ld.lld-10 -shared -o b.so b.o
$ objdump -d b.so

b.so: file format elf64-x86-64

Disassembly of section .text:

00000000000012b0 <_start>:
    12b0: e8 1b 00 00 00 callq 12d0 <foo@plt>

00000000000012b5 <foo>:
    12b5: c3 retq

Disassembly of section .plt:

00000000000012c0 <foo@plt-0x10>:
    12c0: ff 35 d2 20 00 00 pushq 0x20d2(%rip) # 3398 <_DYNAMIC+0x10b8>
    12c6: ff 25 d4 20 00 00 jmpq *0x20d4(%rip) # 33a0 <_DYNAMIC+0x10c0>
    12cc: 0f 1f 40 00 nopl 0x0(%rax)

00000000000012d0 <foo@plt>:
    12d0: ff 25 d2 20 00 00 jmpq *0x20d2(%rip) # 33a8 <foo+0x20f3>
    12d6: 68 00 00 00 00 pushq $0x0
    12db: e9 e0 ff ff ff jmpq 12c0 <foo+0xb>

Thus the call to foo now goes through the PLT.

There is no such a need for a new option. These desired characteristics can be achieved with existing mechanisms.

$ ld.lld-10 --Bsymbolic -shared -o b.so b.o
$ objdump -d b.so

b.so: file format elf64-x86-64

Disassembly of section .text:

0000000000001298 <_start>:
    1298: e8 00 00 00 00 callq 129d <foo>

000000000000129d <foo>:
    129d: c3 retq

It appears you both are on to something. :slight_smile:

I still have to experiment to see if this works in my environment. But given my desired outcome of the above clang5 scenario (eliminating the PLT entry) this appears promising. By ignoring the diagnostic when linking clang5-generated code, was --Bsymbolic behavior effectively already in use?

Thanks,

Justin

Thank you both for your replies, the links to the source, and the details about preemptible symbols.

If you are using an older linker (either GNU ld or LLD), it may not have such a diagnostic.

Thank you for this concise example. The linker does in fact have this diagnostic, but it is being ignored.

So, for:

.globl _start, foo
_start:
  .byte 0xe8
  .reloc ., R_X86_64_PC32, foo - 4
  .long foo - .
  foo:
  ret

$ as a.s -o a.o
$ ld.lld-10 -noinhibit-exec -shared -o a.so a.o # ld.lld-10: warning: relocation R_X86_64_PC32 cannot be used against symbol foo; recompile with -fPIC
$ objdump -d a.so

a.so: file format elf64-x86-64

Disassembly of section .text:

0000000000001298 <_start>:
   1298: e8 04 00 00 00 callq 12a1 <foo+0x4>

000000000000129d <foo>:
   129d: c3 retq

The call to foo does not go through the PLT. That's the behavior seen using clang5. But clang10 generates the PLT32 relocation instead, like this:

$ cat b.s
.globl _start, foo
_start:
  .byte 0xe8
  .reloc ., R_X86_64_PLT32, foo - 4
  .long foo - .
  foo:
  ret

$ as b.s -o b.o
$ ld.lld-10 -shared -o b.so b.o
$ objdump -d b.so

b.so: file format elf64-x86-64

Disassembly of section .text:

00000000000012b0 <_start>:
   12b0: e8 1b 00 00 00 callq 12d0 <foo@plt>

00000000000012b5 <foo>:
   12b5: c3 retq

Disassembly of section .plt:

00000000000012c0 <foo@plt-0x10>:
   12c0: ff 35 d2 20 00 00 pushq 0x20d2(%rip) # 3398 <_DYNAMIC+0x10b8>
   12c6: ff 25 d4 20 00 00 jmpq *0x20d4(%rip) # 33a0 <_DYNAMIC+0x10c0>
   12cc: 0f 1f 40 00 nopl 0x0(%rax)

00000000000012d0 <foo@plt>:
   12d0: ff 25 d2 20 00 00 jmpq *0x20d2(%rip) # 33a8 <foo+0x20f3>
   12d6: 68 00 00 00 00 pushq $0x0
   12db: e9 e0 ff ff ff jmpq 12c0 <foo+0xb>

Thus the call to foo now goes through the PLT.

There is no such a need for a new option. These desired characteristics can be achieved with existing mechanisms.

$ ld.lld-10 --Bsymbolic -shared -o b.so b.o
$ objdump -d b.so

b.so: file format elf64-x86-64

Disassembly of section .text:

0000000000001298 <_start>:
   1298: e8 00 00 00 00 callq 129d <foo>

000000000000129d <foo>:
   129d: c3 retq

It appears you both are on to something. :slight_smile:

I still have to experiment to see if this works in my environment. But given my desired outcome of the above clang5 scenario (eliminating the PLT entry) this appears promising. By ignoring the diagnostic when linking clang5-generated code, was --Bsymbolic behavior effectively already in use?

Thanks,

Justin

You used -noinhibit-exec to ignore the diagnostic, which is usually a
bad thing.

With an R_X86_64_PC32, you are right that after ignoring the diagnostic,
the instructions calls the foo definition.
With an R_X86_64_PLT32, because foo is preemptible, LLD will create a
PLT entry and jump to the PLT entry.

You can use -Bsymbolic to make foo non-preemptible and avoid the PLT entry.

You used -noinhibit-exec to ignore the diagnostic, which is usually a bad thing.

I certainly agree with that.

The point I was trying to make in my original email is that, specifically for kernel objects, this diagnostic is incorrect. R_X86_64_PC32 can be used safely against the symbol foo in that specific context, and should be possible without ignoring diagnostics. I wondered if there should be a way for the user to share that notion with the linker (my earlier "-mcmodel=kernel for the linker" comments).

But I understand your point: there are already ways (listed in your previous reply) to share that notion with the linker. Perhaps not in the same way I was imagining, but with the same end result.

Again, I appreciate the replies and help understanding D43383. I'm going to continue to experiment given this information.

Thanks,

Justin

> You used -noinhibit-exec to ignore the diagnostic, which is usually a bad thing.

I certainly agree with that.

The point I was trying to make in my original email is that, specifically for kernel objects, this diagnostic is incorrect. R_X86_64_PC32 can be used safely against the symbol foo in that specific context, and should be possible without ignoring diagnostics. I wondered if there should be a way for the user to share that notion with the linker (my earlier "-mcmodel=kernel for the linker" comments).

The linker flags were designed for userspace programs and I've heard a
complaint from the Linux kernel community as well.
Again, the kernel do not need an additional linker flag. If ELF
interposition is not possible, the existing -Bsymbolic works.