A Propeller link (similar to a Thin Link as used by ThinLTO)?

I met with the Propeller team today (we work for the same company but it
was my first time meeting two members on the team:) ).
One thing I have been reassured:

* There is no general disassembly work. General
disassembly work would assuredly frighten off developers. (Inherently
unreliable, memory usage heavy and difficult to deal with CFI, debug
information, etc)

Minimal amount of plumbing work (https://reviews.llvm.org/D68065) is
acceptable: locating the jump relocation, detecting the jump type,
inverting the direction of a jump, and deleting trailing bytes of an
input section. The existing linker relaxation schemes already do similar
things. Deleting a trailing jump is similar to RISC-V where sections can
shrink (not implemented in lld; R_RISCV_ALIGN and R_RISCV_RELAX are in
my mind)) (binutils supports deleting bytes for a few other
architectures, e.g. msp430, sh, mips, ft32, rl78). With just minimal
amount of disassembly work, conceptually the framework should not be too
hard to be ported to another target.

One thing I was not aware of (perhaps the description did not make it clear) is that
Propeller intends to **reorder basic block sections across translation units**.
This is something that full LTO can do while ThinLTO cannot.
Our internal systems cannot afford doing a full LTO (**Can we fix the bottleneck of full LTO** [1]?)
for large executables and I believe some other users are in the same camp.

Now, with ThinLTO, the post link optimization scheme will inevitably require
help from the linker/compiler. It seems we have two routes:

## Route 1: Current Propeller framework

lld does whole-program reordering of basic block sections. We can extend it in
the future to overalign some sections and pad gaps with NOPs. What else can we
do? Source code/IR/MCInst is lost at this stage. Without general assembly
work, it may be difficult to do more optimization.

This makes me concerned of another thing: Intel's Jump Condition Code Erratum.

Put it in the simplest way, a Jcc instruction whose address ≡ 30 or 31
(mod 32) should be avoided. There are assembler level (MC) mitigations
(function sections are overaligned to 32), but because we use basic
block sections (sh_addralign<32) and need reordering, we have to redo
some work at the linking stage.

After losing the representation of MCInst, it is not clear to me how we can
insert NOPs/segment override prefixes without doing disassembly work in the linker.

Route 2 does heavy lifting work in the compiler, which can naturally reuse the assembler level mitigation,
CFI and debug information generating, and probably other stuff.
(How will debug information be bloated?)

## Route 2: Add another link stage, similar to a Thin Link as used by ThinLTO.

Regular ThinLTO with minimized bitcode files:

  all: compile thin_link thinlto_backend final_link
  
  compile a.o b.o a.indexing.o b.indexing.o: a.c b.c
    $(clang) -O2 -c -flto=thin -fthin-link-bitcode=a.indexing.o a.c
    $(clang) -O2 -c -flto=thin -fthin-link-bitcode=b.indexing.o b.c
  
  thin_link lto/a.o.thinlto.bc lto/b.o.thinlto.bc a.rsp: a.indexing.o b.indexing.o
    $(clang) -fuse-ld=lld -Wl,--thinlto-index-only=a.rsp -Wl,--thinlto-prefix-replace=';lto' -Wl,--thinlto-object-suffix-replace='.indexing.o;.o' a.indexing.o b.indexing.o
  
  thinlto_backend lto/a.o lto/b.o: a.o b.o lto/a.o.thinlto.bc lto/b.o.thinlto.bc
    $(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc a.o -o lto/a.o
    $(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc b.o -o lto/b.o
  
  final_link exe: lto/a.o lto/b.o a.rsp
    # Propeller does basic block section reordering here.
    $(clang) -fuse-ld=lld @a.rsp -o exe

We need to replace the two stages thinlto_backend and final_link with
three.

Propelled ThinLTO with minimized bitcode files:
  
  propelled_thinlto_backend lto/a.mir lto/b.mir: a.o b.o lto/a.o.thinlto.bc lto/b.o.thinlto.bc
    # Propeller emits something similar to a Machine IR file.
    # a.o and b.o are all IR files.
    $(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc -fpropeller a.o -o lto/a.mir
    $(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc -fpropeller b.o -o lto/b.mir
  
  propeller_link propeller/a.o propeller/b.o: lto/a.mir lto/b.mir
    # Propeller collects input Machine IR files,
    # spawn threads to generate object files parallelly.
    $(clang) -fpropeller-backend -fpropeller-prefix-replace='lto;propeller' lto/a.mir lto/b.mir
  
  final_link exe: propeller/a.o propeller/b.o
    # GNU ld/gold/lld links object files.
    $(clang) $^ -o exe

A .mir may be much large than an object file. So lto/a.mir may be
actually an object file annotated with some information, or some lower
level representation than a Machine IR (there should be a guarantee that
the produced object file will keep the basic block structure unchanged
=> otherwise basic block profiling information will not be too useful).

[1]: **Can we fix the bottleneck of full LTO** [1]?

I wonder whether we have reached a "local maximum" of ThinLTO.
If full LTO were nearly as fast as ThinLTO, how would we design a post-link optimization framework?
Apparently, if full LTO did not have the scalability problem, we would
not do so much work in the linker?

I met with the Propeller team today (we work for the same company but it
was my first time meeting two members on the team:) ).
One thing I have been reassured:

  • There is no general disassembly work. General
    disassembly work would assuredly frighten off developers. (Inherently
    unreliable, memory usage heavy and difficult to deal with CFI, debug
    information, etc)

Minimal amount of plumbing work (https://reviews.llvm.org/D68065) is
acceptable: locating the jump relocation, detecting the jump type,
inverting the direction of a jump, and deleting trailing bytes of an
input section. The existing linker relaxation schemes already do similar
things. Deleting a trailing jump is similar to RISC-V where sections can
shrink (not implemented in lld; R_RISCV_ALIGN and R_RISCV_RELAX are in
my mind)) (binutils supports deleting bytes for a few other
architectures, e.g. msp430, sh, mips, ft32, rl78). With just minimal
amount of disassembly work, conceptually the framework should not be too
hard to be ported to another target.

One thing I was not aware of (perhaps the description did not make it clear) is that
Propeller intends to reorder basic block sections across translation units.
This is something that full LTO can do while ThinLTO cannot.
Our internal systems cannot afford doing a full LTO (Can we fix the bottleneck of full LTO [1]?)
for large executables and I believe some other users are in the same camp.

Now, with ThinLTO, the post link optimization scheme will inevitably require
help from the linker/compiler. It seems we have two routes:

Route 1: Current Propeller framework

lld does whole-program reordering of basic block sections. We can extend it in
the future to overalign some sections and pad gaps with NOPs. What else can we
do? Source code/IR/MCInst is lost at this stage. Without general assembly
work, it may be difficult to do more optimization.

This makes me concerned of another thing: Intel’s Jump Condition Code Erratum.
https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf

Put it in the simplest way, a Jcc instruction whose address ≡ 30 or 31
(mod 32) should be avoided. There are assembler level (MC) mitigations
(function sections are overaligned to 32), but because we use basic
block sections (sh_addralign<32) and need reordering, we have to redo
some work at the linking stage.

After losing the representation of MCInst, it is not clear to me how we can
insert NOPs/segment override prefixes without doing disassembly work in the linker.

I’m not sure how the basic-block sections feature makes it hard to implement a mitigation for that specific JCC erratum. I may be missing something, but doesn’t the BB sections actually make it easier to implement, as the JCC occurs only at the ending of each basic block, and with the BB sections we know what the ending instruction is for each block? I mean, when we are reordering sections, and if we found some BB with JCC is not at a desired address, we can add a padding before that BB.

Hi Fangrui,

Not sure why you started a new conversation when you could have just replied to the existing thread.

I met with the Propeller team today (we work for the same company but it
was my first time meeting two members on the team:) ).
One thing I have been reassured:

  • There is no general disassembly work. General
    disassembly work would assuredly frighten off developers. (Inherently
    unreliable, memory usage heavy and difficult to deal with CFI, debug
    information, etc)

Minimal amount of plumbing work (https://reviews.llvm.org/D68065) is
acceptable: locating the jump relocation, detecting the jump type,
inverting the direction of a jump, and deleting trailing bytes of an
input section. The existing linker relaxation schemes already do similar
things. Deleting a trailing jump is similar to RISC-V where sections can
shrink (not implemented in lld; R_RISCV_ALIGN and R_RISCV_RELAX are in
my mind)) (binutils supports deleting bytes for a few other
architectures, e.g. msp430, sh, mips, ft32, rl78). With just minimal
amount of disassembly work, conceptually the framework should not be too
hard to be ported to another target.

One thing I was not aware of (perhaps the description did not make it clear) is that
Propeller intends to reorder basic block sections across translation units.

This was the intention all along with basic block sections from the very beginning.

This is something that full LTO can do while ThinLTO cannot.
Our internal systems cannot afford doing a full LTO (Can we fix the bottleneck of full LTO [1]?)
for large executables and I believe some other users are in the same camp.

Now, with ThinLTO, the post link optimization scheme will inevitably require
help from the linker/compiler. It seems we have two routes:

Route 1: Current Propeller framework

lld does whole-program reordering of basic block sections. We can extend it in
the future to overalign some sections and pad gaps with NOPs. What else can we
do? Source code/IR/MCInst is lost at this stage. Without general assembly
work, it may be difficult to do more optimization.

This makes me concerned of another thing: Intel’s Jump Condition Code Erratum.
https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf

Put it in the simplest way, a Jcc instruction whose address ≡ 30 or 31
(mod 32) should be avoided. There are assembler level (MC) mitigations
(function sections are overaligned to 32), but because we use basic
block sections (sh_addralign<32) and need reordering, we have to redo
some work at the linking stage.

After losing the representation of MCInst, it is not clear to me how we can
insert NOPs/segment override prefixes without doing disassembly work in the linker.

Route 2 does heavy lifting work in the compiler, which can naturally reuse the assembler level mitigation,
CFI and debug information generating, and probably other stuff.
(How will debug information be bloated?)

Route 2: Add another link stage, similar to a Thin Link as used by ThinLTO.

Regular ThinLTO with minimized bitcode files:

all: compile thin_link thinlto_backend final_link

compile a.o b.o a.indexing.o b.indexing.o: a.c b.c
$(clang) -O2 -c -flto=thin -fthin-link-bitcode=a.indexing.o a.c
$(clang) -O2 -c -flto=thin -fthin-link-bitcode=b.indexing.o b.c

thin_link lto/a.o.thinlto.bc lto/b.o.thinlto.bc a.rsp: a.indexing.o b.indexing.o
$(clang) -fuse-ld=lld -Wl,–thinlto-index-only=a.rsp -Wl,–thinlto-prefix-replace=‘;lto’ -Wl,–thinlto-object-suffix-replace=‘.indexing.o;.o’ a.indexing.o b.indexing.o

thinlto_backend lto/a.o lto/b.o: a.o b.o lto/a.o.thinlto.bc lto/b.o.thinlto.bc
$(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc a.o -o lto/a.o
$(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc b.o -o lto/b.o

final_link exe: lto/a.o lto/b.o a.rsp

Propeller does basic block section reordering here.

$(clang) -fuse-ld=lld @a.rsp -o exe

We need to replace the two stages thinlto_backend and final_link with
three.

I am not sure I fully follow what you mean here but it seems to be along the lines of going back to MIR to do the optimizations. We are considering this and we have even discussed this with Eli in the original thread:

http://lists.llvm.org/pipermail/llvm-dev/2019-September/135455.html

For example, we are looking at inserting prefetch instructions at specific points in the binary. We would not be disassembling native code to do that but would be doing it in MIR.

Propelled ThinLTO with minimized bitcode files:

propelled_thinlto_backend lto/a.mir lto/b.mir: a.o b.o lto/a.o.thinlto.bc lto/b.o.thinlto.bc

Propeller emits something similar to a Machine IR file.

a.o and b.o are all IR files.

$(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc -fpropeller a.o -o lto/a.mir
$(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc -fpropeller b.o -o lto/b.mir

propeller_link propeller/a.o propeller/b.o: lto/a.mir lto/b.mir

Propeller collects input Machine IR files,

spawn threads to generate object files parallelly.

$(clang) -fpropeller-backend -fpropeller-prefix-replace=‘lto;propeller’ lto/a.mir lto/b.mir

final_link exe: propeller/a.o propeller/b.o

GNU ld/gold/lld links object files.

$(clang) $^ -o exe

A .mir may be much large than an object file. So lto/a.mir may be
actually an object file annotated with some information, or some lower
level representation than a Machine IR (there should be a guarantee that
the produced object file will keep the basic block structure unchanged
=> otherwise basic block profiling information will not be too useful).

[1]: Can we fix the bottleneck of full LTO [1]?

I wonder whether we have reached a “local maximum” of ThinLTO.
If full LTO were nearly as fast as ThinLTO, how would we design a post-link optimization framework?
Apparently, if full LTO did not have the scalability problem, we would
not do so much work in the linker?

Full LTO has very high overheads for medium to large binaries. As a data point, I ran a Full LTO optimization of a binary with 350M of text and I had to kill the process after RSS went to 175G. I couldn’t get it to run on my beefy machine with 192G of RAM.

Hope this helps address some of your concerns.

Thanks
Sri

I met with the Propeller team today (we work for the same company but it
was my first time meeting two members on the team:) ).
One thing I have been reassured:

* There is no general disassembly work. General
disassembly work would assuredly frighten off developers. (Inherently
unreliable, memory usage heavy and difficult to deal with CFI, debug
information, etc)

Minimal amount of plumbing work (https://reviews.llvm.org/D68065) is
acceptable: locating the jump relocation, detecting the jump type,
inverting the direction of a jump, and deleting trailing bytes of an
input section. The existing linker relaxation schemes already do similar
things. Deleting a trailing jump is similar to RISC-V where sections can
shrink (not implemented in lld; R_RISCV_ALIGN and R_RISCV_RELAX are in
my mind)) (binutils supports deleting bytes for a few other
architectures, e.g. msp430, sh, mips, ft32, rl78). With just minimal
amount of disassembly work, conceptually the framework should not be too
hard to be ported to another target.

One thing I was not aware of (perhaps the description did not make it
clear) is that
Propeller intends to **reorder basic block sections across translation
units**.
This is something that full LTO can do while ThinLTO cannot.
Our internal systems cannot afford doing a full LTO (**Can we fix the
bottleneck of full LTO** [1]?)
for large executables and I believe some other users are in the same camp.

Now, with ThinLTO, the post link optimization scheme will inevitably
require
help from the linker/compiler. It seems we have two routes:

## Route 1: Current Propeller framework

lld does whole-program reordering of basic block sections. We can extend
it in
the future to overalign some sections and pad gaps with NOPs. What else
can we
do? Source code/IR/MCInst is lost at this stage. Without general assembly
work, it may be difficult to do more optimization.

This makes me concerned of another thing: Intel's Jump Condition Code
Erratum.

https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf

Put it in the simplest way, a Jcc instruction whose address ≡ 30 or 31
(mod 32) should be avoided. There are assembler level (MC) mitigations
(function sections are overaligned to 32), but because we use basic
block sections (sh_addralign<32) and need reordering, we have to redo
some work at the linking stage.

After losing the representation of MCInst, it is not clear to me how we can
insert NOPs/segment override prefixes without doing disassembly work in
the linker.

I'm not sure how the basic-block sections feature makes it hard to
implement a mitigation for that specific JCC erratum. I may be missing
something, but doesn't the BB sections actually make it easier to
implement, as the JCC occurs only at the ending of each basic block, and
with the BB sections we know what the ending instruction is for each block?
I mean, when we are reordering sections, and if we found some BB with JCC
is not at a desired address, we can add a padding before that BB.

Loss of MachineInstr/MCInst (what we have are ELF object files) and
refraining from disassembly makes it hard to implement the Intel JCC
Erratum mitigation.

Inserting padding can increase the distance of JCC_1/JMP_1 instructions.
JCC_1/JMP_1 may need to be relaxed to JCC_4/JMP_4.

jb 1f; nop; 1: # 72 01 jb 0x3
jb 2f; .space 0x7f, 0x90; 2: # 72 7f jb 0x84
jb 3f; .space 0x80, 0x90; 3: # 0f 82 80 00 00 00 jb 0x10a

Without disassembly, we can only add NOPs, but not the superior segment
override prefixes. Note that x86 has several instructions which are
documented (Table 24-3. Format of Interruptibility State") as enabling
interrupts exactly one instruction after the one which changes the SS
segment register. Inserting a nop allows an interrupt to arrive before
the execution of the following instruction which changes semantic
behaviour.

   # NOP before jmp can change semantic behavior.
   sti; jmp baz
   movl %esi, %ss; jmp baz
   movw (%rsi), %ss; jmp baz

Well, we may go the MSP430 route: always generate the maximum range branch
instructions, and rely on the linker to relax instructions. This is also
what RISC-V does. (I mentioned both MSP430 and RISC-V in my previous
message:) ) There are many challenges here.

Coming back to the topic. We really have 2 routes.

a) Start from ELF object files, add sufficient metadata that supports post-link optimization
b) Start from machine level representations (MIR-like), remove unneeded details to make whole-program post-link optimization affordable.

Our current route is a). To avoid creating JCC_1/JMP_1 which are close
to the jump distance limit, we may have to implement something similar
to TargetInstrInfo::getInstSizeInBytes for x86, add BranchRelaxation
to addPreEmitPass(), so that JCC_1/JMP_1 can be avoided in the first place.
Unfortunately, getInstSizeInBytes is not great when there is inline
assembly, because a pseudo instruction or a macro can expand to multiple
real instructions. In this regard, LLVM is better than GCC because GCC
just counts the statements
(Size of an asm (Using the GNU Compiler Collection (GCC)))

Debug information is another hassle. We only have relocation records,
and we will use conservative forms to make debug information not break
after Propeller optimization. If we start from DwarfDebug, we will be
more confident that nothing will break.

I understand that out focus has always been b), and it is (huge) sunk
cost if we change the direction to a). I am just concerned how many new
things we will discover in the future which is mandatory to annotate ELF
object files.

Starting from b) is a challenge, because it will push us to rethink
various LLVM infrastructures. The nice thing in return for Propeller is
immediate reuse of facility already provided by the compiler. The nice
thing in a long term is flexible frameworks that will benefit the
overall LLVM project and other future optimizations.

Additionaly, Sri told me that we would also do compiler-inserted
prefetching (this topic has been thoroughly studied by prior art, so I
don't think exposing the information is sensitive at all). I cannot
imagine how we would do it without more machine level information.

I met with the Propeller team today (we work for the same company but it
was my first time meeting two members on the team:) ).
One thing I have been reassured:

  • There is no general disassembly work. General
    disassembly work would assuredly frighten off developers. (Inherently
    unreliable, memory usage heavy and difficult to deal with CFI, debug
    information, etc)

Minimal amount of plumbing work (https://reviews.llvm.org/D68065) is
acceptable: locating the jump relocation, detecting the jump type,
inverting the direction of a jump, and deleting trailing bytes of an
input section. The existing linker relaxation schemes already do similar
things. Deleting a trailing jump is similar to RISC-V where sections can
shrink (not implemented in lld; R_RISCV_ALIGN and R_RISCV_RELAX are in
my mind)) (binutils supports deleting bytes for a few other
architectures, e.g. msp430, sh, mips, ft32, rl78). With just minimal
amount of disassembly work, conceptually the framework should not be too
hard to be ported to another target.

One thing I was not aware of (perhaps the description did not make it
clear) is that
Propeller intends to reorder basic block sections across translation
units
.
This is something that full LTO can do while ThinLTO cannot.
Our internal systems cannot afford doing a full LTO (Can we fix the
bottleneck of full LTO
[1]?)
for large executables and I believe some other users are in the same camp.

Now, with ThinLTO, the post link optimization scheme will inevitably
require
help from the linker/compiler. It seems we have two routes:

Route 1: Current Propeller framework

lld does whole-program reordering of basic block sections. We can extend
it in
the future to overalign some sections and pad gaps with NOPs. What else
can we
do? Source code/IR/MCInst is lost at this stage. Without general assembly
work, it may be difficult to do more optimization.

This makes me concerned of another thing: Intel’s Jump Condition Code
Erratum.

https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf

Put it in the simplest way, a Jcc instruction whose address ≡ 30 or 31
(mod 32) should be avoided. There are assembler level (MC) mitigations
(function sections are overaligned to 32), but because we use basic
block sections (sh_addralign<32) and need reordering, we have to redo
some work at the linking stage.

After losing the representation of MCInst, it is not clear to me how we can
insert NOPs/segment override prefixes without doing disassembly work in
the linker.

I’m not sure how the basic-block sections feature makes it hard to
implement a mitigation for that specific JCC erratum. I may be missing
something, but doesn’t the BB sections actually make it easier to
implement, as the JCC occurs only at the ending of each basic block, and
with the BB sections we know what the ending instruction is for each block?
I mean, when we are reordering sections, and if we found some BB with JCC
is not at a desired address, we can add a padding before that BB.

Loss of MachineInstr/MCInst (what we have are ELF object files) and
refraining from disassembly makes it hard to implement the Intel JCC
Erratum mitigation.

Inserting padding can increase the distance of JCC_1/JMP_1 instructions.
JCC_1/JMP_1 may need to be relaxed to JCC_4/JMP_4.

jb 1f; nop; 1: # 72 01 jb 0x3
jb 2f; .space 0x7f, 0x90; 2: # 72 7f jb 0x84
jb 3f; .space 0x80, 0x90; 3: # 0f 82 80 00 00 00 jb 0x10a

Without disassembly, we can only add NOPs, but not the superior segment
override prefixes. Note that x86 has several instructions which are
documented (Table 24-3. Format of Interruptibility State") as enabling
interrupts exactly one instruction after the one which changes the SS
segment register. Inserting a nop allows an interrupt to arrive before
the execution of the following instruction which changes semantic
behaviour.

NOP before jmp can change semantic behavior.

sti; jmp baz
movl %esi, %ss; jmp baz
movw (%rsi), %ss; jmp baz

Well, we may go the MSP430 route: always generate the maximum range branch
instructions, and rely on the linker to relax instructions. This is also
what RISC-V does. (I mentioned both MSP430 and RISC-V in my previous
message:) ) There are many challenges here.

Coming back to the topic. We really have 2 routes.

a) Start from ELF object files, add sufficient metadata that supports post-link optimization
b) Start from machine level representations (MIR-like), remove unneeded details to make whole-program post-link optimization affordable.

Our current route is a). To avoid creating JCC_1/JMP_1 which are close
to the jump distance limit, we may have to implement something similar
to TargetInstrInfo::getInstSizeInBytes for x86, add BranchRelaxation
to addPreEmitPass(), so that JCC_1/JMP_1 can be avoided in the first place.
Unfortunately, getInstSizeInBytes is not great when there is inline
assembly, because a pseudo instruction or a macro can expand to multiple
real instructions. In this regard, LLVM is better than GCC because GCC
just counts the statements
(https://gcc.gnu.org/onlinedocs/gcc/Size-of-an-asm.html#Size-of-an-asm)

Debug information is another hassle. We only have relocation records,
and we will use conservative forms to make debug information not break
after Propeller optimization. If we start from DwarfDebug, we will be
more confident that nothing will break.

FWIW on this point in particular (Debug info + Propeller), I’ve worked with Sri & others to fairly well (to the best of my ability) validate the DWARF produced by propeller. There are some things that still need generalizing & improving (essentially, the ability to detect when a label difference is known at compile-time or not & then using that to select slightly different DWARF output - for now the Propeller prototype hardcodes a few of these cases rather than using such a general mechanism) but in general I’m satisfied DWARF can handle this in a way that doesn’t require special handling by the linker. There are some opportunities (applicable without Propeller, but perhaps a bit more significant with Propeller) to do some selectively content-aware changes in the linker (eg: range list collapsing - doesn’t require reading all the DWARF, just the range lists which are much smaller/simpler).

I met with the Propeller team today (we work for the same company but it
was my first time meeting two members on the team:) ).
One thing I have been reassured:

  • There is no general disassembly work. General
    disassembly work would assuredly frighten off developers. (Inherently
    unreliable, memory usage heavy and difficult to deal with CFI, debug
    information, etc)

Minimal amount of plumbing work (https://reviews.llvm.org/D68065) is
acceptable: locating the jump relocation, detecting the jump type,
inverting the direction of a jump, and deleting trailing bytes of an
input section

. The existing linker relaxation schemes already do similar
things. Deleting a trailing jump is similar to RISC-V where sections can
shrink (not implemented in lld; R_RISCV_ALIGN and R_RISCV_RELAX are in
my mind)) (binutils supports deleting bytes for a few other
architectures, e.g. msp430, sh, mips, ft32, rl78). With just minimal
amount of disassembly work, conceptually the framework should not be too
hard to be ported to another target.

One thing I was not aware of (perhaps the description did not make it clear) is that
Propeller intends to reorder basic block sections across translation units.
This is something that full LTO can do while ThinLTO cannot.
Our internal systems cannot afford doing a full LTO (Can we fix the bottleneck of full LTO [1]?)
for large executables and I believe some other users are in the same camp.

Right, beyond distributed build system, even on a single machine and for “small” projects like clang: building on a laptop with FullLTO can be challenging in terms of memory consumption, and the iterative development is just not practical.

Now, with ThinLTO, the post link optimization scheme will inevitably require
help from the linker/compiler. It seems we have two routes:

Route 1: Current Propeller framework

lld does whole-program reordering of basic block sections. We can extend it in
the future to overalign some sections and pad gaps with NOPs. What else can we
do? Source code/IR/MCInst is lost at this stage. Without general assembly
work, it may be difficult to do more optimization.

This makes me concerned of another thing: Intel’s Jump Condition Code Erratum.
https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf

Put it in the simplest way, a Jcc instruction whose address ≡ 30 or 31
(mod 32) should be avoided. There are assembler level (MC) mitigations
(function sections are overaligned to 32), but because we use basic
block sections (sh_addralign<32) and need reordering, we have to redo
some work at the linking stage.

After losing the representation of MCInst, it is not clear to me how we can
insert NOPs/segment override prefixes without doing disassembly work in the linker.

Route 2 does heavy lifting work in the compiler, which can naturally reuse the assembler level mitigation,
CFI and debug information generating, and probably other stuff.
(How will debug information be bloated?)

Route 2: Add another link stage, similar to a Thin Link as used by ThinLTO.

Regular ThinLTO with minimized bitcode files:

all: compile thin_link thinlto_backend final_link

compile a.o b.o a.indexing.o b.indexing.o: a.c b.c
$(clang) -O2 -c -flto=thin -fthin-link-bitcode=a.indexing.o a.c
$(clang) -O2 -c -flto=thin -fthin-link-bitcode=b.indexing.o b.c

thin_link lto/a.o.thinlto.bc lto/b.o.thinlto.bc a.rsp: a.indexing.o b.indexing.o
$(clang) -fuse-ld=lld -Wl,–thinlto-index-only=a.rsp -Wl,–thinlto-prefix-replace=‘;lto’ -Wl,–thinlto-object-suffix-replace=‘.indexing.o;.o’ a.indexing.o b.indexing.o

thinlto_backend lto/a.o lto/b.o: a.o b.o lto/a.o.thinlto.bc lto/b.o.thinlto.bc
$(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc a.o -o lto/a.o
$(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc b.o -o lto/b.o

final_link exe: lto/a.o lto/b.o a.rsp

Propeller does basic block section reordering here.

$(clang) -fuse-ld=lld @a.rsp -o exe

We need to replace the two stages thinlto_backend and final_link with
three.

Propelled ThinLTO with minimized bitcode files:

propelled_thinlto_backend lto/a.mir lto/b.mir: a.o b.o lto/a.o.thinlto.bc lto/b.o.thinlto.bc

Propeller emits something similar to a Machine IR file.

a.o and b.o are all IR files.

$(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc -fpropeller a.o -o lto/a.mir
$(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc -fpropeller b.o -o lto/b.mir

propeller_link propeller/a.o propeller/b.o: lto/a.mir lto/b.mir

Propeller collects input Machine IR files,

spawn threads to generate object files parallelly.

$(clang) -fpropeller-backend -fpropeller-prefix-replace=‘lto;propeller’ lto/a.mir lto/b.mir

final_link exe: propeller/a.o propeller/b.o

GNU ld/gold/lld links object files.

$(clang) $^ -o exe

There was an interesting talk last week at the LLVM performance workshop: Global Machine Outliner for ThinLTO which introduced a similar stage in ThinLTO (for another purpose though). I believe they avoid the serialization of MIR by running the CodeGen twice instead (once to collect the cross-module informations, and the second time using these informations).
CC the author in case the slides are already available online.

A .mir may be much large than an object file. So lto/a.mir may be
actually an object file annotated with some information, or some lower
level representation than a Machine IR (there should be a guarantee that
the produced object file will keep the basic block structure unchanged
=> otherwise basic block profiling information will not be too useful).

[1]: Can we fix the bottleneck of full LTO [1]?

I wonder whether we have reached a “local maximum” of ThinLTO.
If full LTO were nearly as fast as ThinLTO, how would we design a post-link optimization framework?
Apparently, if full LTO did not have the scalability problem, we would
not do so much work in the linker?

At lot of work went into ThinLTO because the scalability issue of LTO was considered inherent to the design. It isn’t clear what you’re suggesting here though?

Hereby, we discuss our plan for handling Intel’s JCC mitigation as follows.

TLDR; By computing basic block groupings early, the compiler can form larger clusters of basic blocks (each cluster in a section) which will allow Propeller to just reuse the assembler’s mitigation. Our experiments show that when JCC mitigation causes only 0.2% slowdown for Propeller, compared to the 0.6% slowdown incurred for the vanilla configuration.

A slightly longer summary:

  • We evaluated a Propeller prototype to reuse the existing assembler mitigation in llvm, -mbranches-within-32B-boundary, which currently uses only NOPs for mitigation.

  • With some changes, Propeller is able to reuse the existing assembler mitigation. To do this, we form large basic block clusters (sections containing multiple basic blocks) in the compiler by computing the basic block layout earlier.

  • Vanilla clang benchmark (no Propeller) regresses by ~0.6% with this flag.

  • With Propeller, the exact same flag regresses clang only by ~0.2%, reducing the total speedup from 7.8% to 7.6%.

  • For similar problems, the solution is most optimally implemented in the linker. However, for this particular problem, it appears that the assembler’s mitigation is good enough when combined with Propeller.

Background

The JCC erratum is a CPU bug affecting Skylake processors which results in unpredictable behaviour under complex micro-architectural states involving the Decoded I-cache, specifically, when executing branches which cross a cache line.

MicroCode Update (MCU) Mitigation

The CPU avoids this bug by bypassing the Decoded ICache for branches crossing 32B boundaries. This sacrifices some performance (0-4%) in return for correctness. The compiler can alleviate this effect by aligning the code such that branches do not cross a 32B boundary. There are two ways that the compiler can do this:

  1. Inserting NOP instructions

  2. Inserting prefixes for instructions

The current solution shipped with clang-10 (under -mbranches-within-32B-boundary) aligns every function at 32B and uses NOPs between instructions. Our experiment shows enabling this option results in 0.6% performance degradation for Clang. There have been some efforts to improve this using instruction prefixes (https://reviews.llvm.org/D72225, https://reviews.llvm.org/D75268) even though there has been some uncertainty about the available headroom (https://reviews.llvm.org/D72225#1818149).

JCC Mitigation in Propeller

Propeller modifies the code layout by emitting basic blocks into sections and reordering them at link time. This means the assembler’s mitigation could be corrupted by Propeller.

There are two ways in which Propeller can solve the problem:

  1. Redo the full mitigation in the linker

  2. Reuse the mitigation that is being implemented in the assembler

Next we discuss each of the two strategies in more detail.

Full Mitigation in the Linker

The current compiler solution is implemented in the assembler backend and its scope is limited to one function at a time (with -function-sections), which requires excessive alignment of 32B for the function entry.

As a post-link optimization infrastructure, Propeller has the global view of all sections in the link time and is at a better position for global optimal JCC mitigation. The challenge for Propeller is finding the location of affected branch instructions, and inserting paddings or prefixes at the right places (some instructions cannot be prepended with prefixes or NOPs). This is easier for the assembler as it has higher-level information about instructions and can use the MC layer structures (such as MCRelaxableFragment) to emit variable-sized paddings or prefixes.

As we discuss next, our prototype relying on the assembler’s mitigation incurs no significant overhead and therefore we do not plan to address this problem in the linker.

Relying on the Assembler’s Mitigation

Propeller can use the assembler’s mitigation on every basic block section. However, this means every basic block would be aligned at 32 bytes. The paddings between the basic blocks may be executed nops which will put significant pressure on the CPU’s frontend.

To reduce the NOP paddings, we would need to emit BB sections at a coarser level of granularity, which would mean emitting multiple basic blocks in the same section. However, currently, Propeller delays the basic block layout computation until link time and hence the actual group of basic blocks (cluster) is only available at link time.

To make this work, we implemented a prototype by moving the layout computation before the final round of Propeller compilation. After the layout is computed, basic block partitions of each function are extracted and passed to the compiler.

For example, consider the following BB layout for a program consisting of two functions foo (with 5 basic blocks) and bar (with a single basic block).

foo

foo.BB.1

foo.BB.2

bar

foo.BB.3

foo.BB.4

The extracted BB partitions are as follows:

foo: { [foo, foo.BB.1, foo.BB.2] , [foo.BB.3, foo.BB.4] }

Bar: { [bar] }

We instruct the compiler to emit foo’s basic blocks in two sections and bar’s single basic block in one section. The assembler applies JCC mitigation on each of the three sections by aligning them at 32 bytes and inserting minimal paddings between instructions within every section. The only change compared to the baseline mitigation with -function-sections is emitting an excessive 32 bytes alignment for foo.BB.3. However, the introduced padding is non-executed code (may have small pressure on the instruction cache and TLB).

We note that the layout algorithm would scatter a function’s basic blocks across multiple partitions judiciously and only if it is advantageous for the performance. For intra-procedural layout, only two clusters are created (hot and cold). Nonetheless, the non-executed paddings for clusters will have minimal impact on performance.

On another note, better code layout could reduce the overhead of JCC mitigation because the hot code would be packed together and the paddings for the cold blocks will not affect the hot code.

Results

We evaluated Clang’s performance under different optimizations with and without JCC mitigation. We used PGO + ThinLTO for all configurations. We tested two propeller code layouts: inter-procedural, and intra-procedural. The intra-procedural results in at most two clusters for every function, while the inter-procedural layout could lead to more.

To use JCC mitigation, we use “-Wl,-mllvm,–x86-branches-within-32B-boundaries -mbranches-within-32B-boundaries".

We ran the clang bootstrap test 10 times for each configuration and measured the average cpu time (user + sys in seconds).

We note that our evaluation is performed on a machine without the microcode update installed.

Mitigation Enabled | Mitigation Disabled |

  • | - | - |
    baseline (PGO + ThinLTO) | 545.362 | 542.012 |
    Propeller intra-proedural | 506.828 | 504.861 |
    Propeller inter-procedural | 503.23 | 502.136 |

Clang’s cpu time relative to the baseline, for different optimization flavors, with and without JCC mitigation

FIrst, JCC Mitigation results in a 0.6% slowdown when applied to the baseline. With Propeller, JCC mitigation incurs 0.4% slowdown for intra-procedural and 0.2% for inter-procedural. The lesser JCC mitigation slowdowns for Propeller configurations shows the impact of better code layout. When hot and cold code are mixed together, the paddings in the cold part could put more pressure on I-Cache and I-TLB.

Conclusion

Using BB clusters, we can reuse the assembler’s JCC mitigation with no significant impact on performance. In fact the slowdown caused by JCC mitigation is lower for Propeller, because of the better code layout.

Finally, we would like to stress once again that Propeller has the potential to do a better job for problems like this JCC mitigation. However, for this particular problem, we have shown that the assembler’s mitigation is good enough to be used along with Propeller.