RFC: LLD range extension thunks

I'm about to start working on range extension thunks in lld. This is
an attempt to summarize the approach I'd like to take and what the
impact will be on lld outside of thunks. I'm interested if anyone has
any constraints the approach will break, alternative suggestions, or
is working on something I'll need to take account of?

I expect range extension thunks to be important for ARM and useful for
AArch64. In principle any target with a limited range branch immediate
instruction could benefit from them.

I've put some detail about range extension thunks at the end of this
message for those not familiar with them. The name range extension
thunks is by no means universal, for example in ARM's ABI
documentation they are called veneers, The GNU linker calls them
stubs.

Summary of existing thunk implementation (ARM interworking and Mips
PIC to non-PIC calls):
- A Regular, Shared or Undefined symbol may have a single thunk
- For each relocation to a symbol S, if we need a thunk we use the
thunk for S as the target of the relocation rather than S. The thunk
will transfer control to S.
- Thunks are assigned to an InputSection, these are written out when
the InputSection is written. The InputSection with the Thunk contains
either the caller (ARM) or callee (Mips).
- For all existing thunks, the decision of whether a thunk is needed
is not dependent on address. A Thumb branch to ARM will always need a
thunk, no matter the distance. Thunks can therefore be generated
relatively early in Writer::run().

High level impact of range extension thunks:

There may be more than one than one thunk per callee:
- A range extension thunk must be placed within range of the caller,
there may be cases where no single thunk for a callee is in range of
all callers.
- An ARM Target may need a different Thunk for ARM and Thumb callers.

Address information is needed to determine if a range extension thunk is needed:
- The more precise the address information available the less thunks
will be generated. the most precise address information is the final
address of caller and callee is known at thunk creation time, the
least precise is neither the address of the caller or callee is known.

Range extension thunks can be combined or replace other thunks
- Thunks may also be used for instruction set interworking (ARM) or
for calling between position independent and non-position independent
code (Mips). Either a chain of thunks or a combined thunk that does
both operations is needed. For ARM all range extension thunks can
trivially be interworking thunks as well.

Range extension thunk placement can be important
- Many callers may need a range extension. Placing a range extension
thunk so that it is in range of the most callers minimizes number of
thunks needed.
- Thunks may be better as synthetic sections rather than as additions
to input sections.

Adding/removing content must not break the range calculations used in
range extension thunks.
- If any caller, callee or thunk address is changed after range
extension thunks are calculated it could invalidate the range
calculation.
- Ideally range extension thunks are the last operation the linker
does prior to resolving relocations.

I think that there are two separate areas to a range extension thunk
implementation that can be considered separately.
1.) Moving thunk generation to a later stage, at a minimum we need an
estimation of the address of each caller and callee, in an ideal world
we know the final address of each caller and callee. This could mean
assigning section addresses multiple times.
2.) The alterations to the core data structures to permit more than
one Thunk per symbol and the logic to select the "right" Thunk for
each relocation.

The design I'd like to aim at moves thunk creation into
finalizeSections() at a point where the sizes and addresses of all the
SyntheticSections are known. This would mean that the final address of
each caller and callee could be used, and after thunk creation there
would be no further content changes. This would mean:
- All code that runs prior to thunk creation may have the offset in
the OutputSection altered by the addition of thunks. In particular
scanRelocs() calculates the offset in the OutputSection of some
relocations. We would need to find alternative ways of handling these
cases so that they could either survive thunk creation or be patched
up afterwards.
- assignAddresses() will need to run at least twice if thunks are
created. At least once to give the thunk creation the caller and
callee addresses, and at least once after all thunks have been
created.

There is an alternative design that only uses estimates of caller and
callee address to decide if a thunk is needed. In effect we use a
heuristic to predict how much extra synthetic content, such as plt and
got size, will be added after Thunk creation when deciding if a Thunk
is needed. I'm not in favour of this approach as from bitter
experience it tends to result in hard to debug problems when the
heuristics break down. Precise addresses would also allow errata
patching thunks [*]

I've not thought too hard about how to alter the core data structures
yet. I think this will mostly be implementation detail though.

Next steps:
I'd like to proceed with the following plan:
1.) Move the existing thunk implementation to where it would need to
be in finalizeSections(). This should flush out all the non-thunk
related assumptions about addresses without adding any existing
complexity to the Thunk implementation.
2.) Add support for multiple thunks per symbol
3.) If it turns out to be a good idea, implement thunks as SyntheticSections
4.) Add support for range extensions.

I think the first implementation of range extension thunks should be
simple and not try too hard to minimize the number of thunks needed.
If there is a need to optimize it can be done later as the changes
should be within the thunk creation module.

Thanks for reading

Peter

The remainder of the message is a brief explanation of range extension
and errata patching thunks.

What are range extension thunks?
Many architectures have branch instructions that have a finite range
that is insufficient to reach all possible program locations. For
example the ARM branch immediate instruction has an immediate that
encodes an offset of +-32Mb from the branch instruction. A range
extension thunk is a linker generated code sequence, inserted between
the caller and the callee, that completes the transfer of control to
the callee when the distance between the caller and callee exceeds the
range of the branch instruction. A simple example in psuedo assembly
for a non-position independent ARM function call.

source:
BL long_range_thunk_to_target
...
long_range_thunk_to_target
LDR r12, target_address ; r12 is the corruptible interprocedural
scratch register (ip)
BX r12
target_address:
.word target ;
...
target:
...

What is an errata patching thunk?
Some CPU errata (hardware bugs) can be fixed at a link time by
replacing an instruction with a branch to a sequence of equivalent
instructions that are guaranteed to to not trigger the erratum. In
some cases the trigger sequence is dependent on precise addresses such
as immediates crossing page boundaries, for example
Marcus Shawcroft - [binutils-gdb] [AArch64] Workaround for Cortex A53 erratum 843419 . Errata
patching is out of the scope of implementing range extension thunks
but can be seen as a generalization of it.

Hi Peter,

Here are my comments:

  • I didn’t think hard enough, but I believe creating thunks as synthetic sections instead of attached data for other input sections is towards a right direction, because synthetic sections are suitable for adding linker-generated data to output files.

  • As you wrote, we need to iterate relocations at least twice to create range extension thunks. Each iteration can be a linear scan, correct? I mean, we can start from the section at the lowest address towards higher address examining relocations and create thunks if targets are too far.

  • I do not see a reason that we need to associate range extension thunks to symbols. It seems to me that while scanning relocations, we need to keep only the last thunk address for each symbol. If we find that some relocation against symbol S needs a range extension thunk, we first check if the last thunk for S is within the range and reuse it if it is. In this way, we need to keep only one thunk for one symbol at any moment.

  • Have you considered rewriting relocations? I think, if we find that relocation R pointing to symbol S needs a range extension thunk, we should (1) create a range extension thunk, (2) create a symbol body object S’ for the thunk, (3) and rewrite R to point to S’ instead of S. Then later passes don’t have to deal with thunks.

Hello Rui,

Thanks for the comments

- Synthetic sections and rewriting relocations
I think that this would definitely be worth trying. It should remove
the need for thunks to be represented in the core data structures, and
would allow .

It would also mean that we wouldn't have to associate symbols with
thunks as the relocations would directly target the thunks. ARM
interworking makes reusing thunks more difficult as not every thunk is
compatible with every caller. For example:
ARM B target and Thumb2 B.W target can't reuse the same thunk even if
in range as the branch instruction can't change state.

I think it is worth an experiment to make the existing implementation
of thunks use synthetic sections and rewriting relocations before
trying to implement range extension thunks.

- Yes the scan is linear it is essentially:
do
    assign addresses to input sections
    for each relocation
        if (thunk needed)
            create thunk or reuse existing one
while (no more thunks added)

There's quite a lot of complexity that can be added with respect to
the placement of thunks within the output section. For example if
there is a caller with a low address and a caller with a high address,
both might be able to reuse a thunk placed in the middle. I think it
is worth starting simple though.

Peter

Hello Rui,

Thanks for the comments

- Synthetic sections and rewriting relocations
I think that this would definitely be worth trying. It should remove
the need for thunks to be represented in the core data structures, and
would allow .

Creating symbols for thunks would have another benefit: it makes
disassembled output easier to read because thunks have names.

It would also mean that we wouldn't have to associate symbols with
thunks as the relocations would directly target the thunks. ARM
interworking makes reusing thunks more difficult as not every thunk is
compatible with every caller. For example:
ARM B target and Thumb2 B.W target can't reuse the same thunk even if
in range as the branch instruction can't change state.

I think it is worth an experiment to make the existing implementation
of thunks use synthetic sections and rewriting relocations before
trying to implement range extension thunks.

- Yes the scan is linear it is essentially:
do
    assign addresses to input sections
    for each relocation
        if (thunk needed)
            create thunk or reuse existing one
while (no more thunks added)

There's quite a lot of complexity that can be added with respect to
the placement of thunks within the output section. For example if
there is a caller with a low address and a caller with a high address,
both might be able to reuse a thunk placed in the middle. I think it
is worth starting simple though.

I agree. I believe that computing the best thunk positions is NP-hard, but
the best layout and a layout produced by a naive algorithm wouldn't be that
different.

Correct conclusion, but there's no way the problem is NP.

Reordering functions (or even individual basic blocks) to minimize the
needed thunks is a complex problem.

But you're not doing that. Once an ordering is selected a simple greedy
algorithm is optimal.

There is no cost difference between a thunk that is right next to the short
jump and a thunk that is only juuust within range. So you can find the
lowest address jump needing a thunk to a particular target and put the
thunk the maximum possible distance after it (after the end of a function,
or even after any unconditional branch). Find everything else within range
of that thunk and fix it up. Repeat.

Other algorithms will give smaller average displacements to the thunks, but
there is no advantage in that. No other algorithm will generate fewer
thunks.

That's assuming all short branches have the same code size and displacement
distance.

If there are multiple branch distances and code sizes (and you have a
choice between them at given call sites) then it's still just a simple
dynamic programming problem, solvable in linear [1] time by trying each
branch size at the first available call site, with a cache of the minimum
cost assuming the first 0, 1, 2 .. N call sites have already been covered.

[1] or at least nCallSites * nBranchSizes

Hello Rui,

Thanks for the comments

- Synthetic sections and rewriting relocations
I think that this would definitely be worth trying. It should remove
the need for thunks to be represented in the core data structures, and
would allow .

Creating symbols for thunks would have another benefit: it makes
disassembled output easier to read because thunks have names.

It would also mean that we wouldn't have to associate symbols with
thunks as the relocations would directly target the thunks. ARM
interworking makes reusing thunks more difficult as not every thunk is
compatible with every caller. For example:
ARM B target and Thumb2 B.W target can't reuse the same thunk even if
in range as the branch instruction can't change state.

I think it is worth an experiment to make the existing implementation
of thunks use synthetic sections and rewriting relocations before
trying to implement range extension thunks.

- Yes the scan is linear it is essentially:
do
    assign addresses to input sections
    for each relocation
        if (thunk needed)
            create thunk or reuse existing one
while (no more thunks added)

There's quite a lot of complexity that can be added with respect to
the placement of thunks within the output section. For example if
there is a caller with a low address and a caller with a high address,
both might be able to reuse a thunk placed in the middle. I think it
is worth starting simple though.

I agree. I believe that computing the best thunk positions is NP-hard,
but the best layout and a layout produced by a naive algorithm wouldn't be
that different.

Correct conclusion, but there's no way the problem is NP.

Reordering functions (or even individual basic blocks) to minimize the
needed thunks is a complex problem.

But you're not doing that. Once an ordering is selected a simple greedy
algorithm is optimal.

There is no cost difference between a thunk that is right next to the
short jump and a thunk that is only juuust within range. So you can find
the lowest address jump needing a thunk to a particular target and put the
thunk the maximum possible distance after it (after the end of a function,
or even after any unconditional branch). Find everything else within range
of that thunk and fix it up. Repeat.

I don't think this analysis is correct. Assume a 1M branch displacement for
short jumps. Consider:

secA: 512K (contains a jump "jumpA" at offset 0 that needs a thunk (it
doesn't matter where it needs to jump to, just that it definitely needs a
thunk))
secB: 512K
secC: 512K (contains a jump "jumpC" at offset 0 that jumps to offset 0 in
secA (i.e., just barely in range of a short jump))

If the thunk for jumpA is placed between secB and secC (as it would be
based on your description) it will push the branch at the beginning of secC
out of range, forcing another thunk to be needed. In this small example,
the thunk for jumpA must be placed before secA in order to avoid needing a
thunk for jumpC. In other words, placing thunks can cause you to need even
more thunks.

-- Sean Silva

Sure, that’s true. I was considering placing thunks from a lot of different origins to the same target. Your example is for different targets. Definitely that’s a much harder problem.

With 1M offsets this is not going to happen often enough to worry about and perfect optimality is probably not worth trying for.

It’s a different matter if you only have +/-128B offsets! Then it will be extremely common. But at the same time there will be 8000 times fewer possible placements of each thunk, and the range of influence of each decision will be very small, making a dynamic programming approach very fast at finding the optimal solution. A bigger problem is the basic block might be larger than the offset range, in which case you need to modify the BB to put the thunk inline (not actually a thunk any more).

After looking at this for a while, I do not think that this problem is NP-hard. With a finite “short branch” displacement k, I was not able to come up with a gadget that could create global constraints as would be needed to e.g. model an instance of 3SAT or vertex cover in terms of this problem.

The problem is hard though. I believe that it is likely to be exponential in the “short branch” displacement k, and k is typically “pretty big”.

– Sean Silva

Sure, that's true. I was considering placing thunks from a lot of
different origins to the same target. Your example is for different
targets. Definitely that's a much harder problem.

With 1M offsets this is not going to happen often enough to worry about
and perfect optimality is probably not worth trying for.

Yes, optimality isn't worth trying for. Peak performance builds will
hopefully someday in LLD have functions reordered so that hot edges
typically won't need thunks. A lightweight heuristic for minimizing extra
TLB footprint due to thunks might be worthwhile though.

It's a different matter if you only have +/-128B offsets! Then it will be
extremely common. But at the same time there will be 8000 times fewer
possible placements of each thunk, and the range of influence of each
decision will be very small, making a dynamic programming approach very
fast at finding the optimal solution.

I don't see how one would use dynamic programming for finding an optimal
solution to this problem because there doesn't seem to be optimal
substructure. In other words, for dynamic programming to be applicable, you
must be able to make a decision using local information and know that it is
optimal and that you will not needed to reconsider it. It's not clear to me
how a smaller branch displacement inherently allows deciding that a choice
is optimal based solely on local information.

-- Sean Silva

Isn’t it the same problem as what ARMConstantIslandPass is trying to address?

I think so. It uses essentially the same iterative approach being discussed in this thread.

Sorry for being late on the thread, but I just wanted to say that I
agree with the design. The problem is very similar to relaxation in MC
and should probably have a similar solution:

* Keep all offsets relative to input/synthetic sections (fragments in
  MC).
* Compute addresses.
* If anything is not in range add a thunk (relax in MC).
* Repeat.

Cheers,
Rafael

Peter Smith <peter.smith@linaro.org> writes:

Now that LLD works well for FreeBSD/amd64 (and arm64 is very close)
I'm looking at other architectures, starting with mips64. The
statically-linked toolchain components currently fail to link with an
out of range jump, so I'm very interested in seeing this work
progress. Are you looking at only arm and AArch64? Once the
infrastructure is in I'll try to take a look at mips if nobody else
does first.

Now that LLD works well for FreeBSD/amd64 (and arm64 is very close)
I'm looking at other architectures, starting with mips64. The
statically-linked toolchain components currently fail to link with an
out of range jump, so I'm very interested in seeing this work
progress. Are you looking at only arm and AArch64? Once the
infrastructure is in I'll try to take a look at mips if nobody else
does first.

I'm waiting for this changes too. Now mips thunks places at the end of the
corresponding section. Not sure about FreeBSD but on Linux that leads to
incorrect code in case of static linking -- a thunk goes between crt*.o
files which needs to be "joined" together. Gnu linker puts thunks to the
separate section. We need to do the same thing.

Out of curiosity, would this be done in a object format agnostic way? Windows ARM requires the same branch island support, and the logic for the detection and placement could be shared across ELF and COFF I suspect.

Out of curiosity, would this be done in a object format agnostic way?
Windows ARM requires the same branch island support, and the logic for the
detection and placement could be shared across ELF and COFF I suspect.

I think that most of the code changes needed for this are going to be just
refactoring to "put a loop around the Writer" and so is inherently specific
to ELF and COFF (since they have different Writer's). Initial thunk
placement heuristics/algorithms etc. are likely to be pretty simple (i.e.
too small lines of code to really share meaningfully). If we start doing
something really complicated for placing thunks it may make sense to factor
out so it can be shared, but it may end up like ICF where the algorithms we
use are the same, but the details of the data structures and semantics
differ enough that it's not clear if sharing them would be a net savings or
not.

-- Sean Silva

I think anything that we come up with will need to support Mips as
well as ARM and AArch64. I'm aware of the Mips LA25 Thunk and the need
for these Thunks to be placed before the Target Section. I'm intending
that the first patch that uses synthetic sections for Thunks will do
this. I'm not intending to go looking for new Mips Thunk types to
implement, I think it would be better that someone with more
experience of Mips and an ability to test properly, implemented these
[*]. I'm similarly ignorant of Power, which has limited range branch
instructions.

My plan for implementation so far is to convert the existing
implementation to use synthetic sections without trying to do range
extension. When that is in and working we can build upon that to
include range extension. If we can make the range detection and thunk
placement sufficiently flexible it shouldn't be difficult to add new
Thunks later.

[*] I can see that the pseudo-direct addressing used by J and JAL
instructions could in principle have range extension thunks and would
need different handling to ARM/AArch64 branches with PC-relative
offsets. I'm not sure whether Mips toolchains typically do use Thunks
in this case or if it is the caller's responsibility to use an
indirect jump via JR instead.

I agree with Sean, most of the work is plumbing within the ELF linker.
I expect that the approach could be translated to the COFF linker but
the majority of the implementation details would be different.

Are you looking only at fixed length ISAs and adding thunks, or also at ISAs with different size branch instructions with different ranges, and relaxing short branches to either longer branches (if that will reach) or to thunks (if not)?

Note that it’s not purely “thunks for RISC” and “longer instructions for CISC”. There are ISAs that could use a mixture of techniques.

For example, Thumb2 has 16 bit branch instructions with a range of -252 - +256, and 32 bit branch instructions with a range of +/-16MB. If neither of those will work then you use a short or long branch to a thunk.

Another example is RISC-V, which has:

  • 16 bit conditional branch instructions with +/-256 bytes range
  • 16 bit unconditional branch(&link) instructions with +/-2KB range
  • 32 bit conditional branch instructions with +/-2KB range
  • 32 bit unconditional branch(&link) instructions with +/-1MB range
  • 32 bit unconditional branch(&link) to reg +/-2KB offset that can be paired with a LUI or AUIPC (which load or add a 20 bit immediate shifted left by 12) to make a two instruction sequence that can do an absolute 32 bit branch or relative branch to PC +/ 2 GB.

So thunks are never needed, but code size can change. If you’re optimizing for size then it could be an advantage to use a thunk anyway, if it can be shared.

The current clang & gcc versions always generate the full two instruction sequence, and then the linker deletes the LUI or AUIPC if the literal is zero. This does not produce as optimal code as starting with short branches and expanding the ones that won’t reach.

Both ARM and RISC-V ABIs have defined scratch registers that the linker (or other) can freely clobber for calculations inside thunks, or when converting instructions to multi-instruction sequences.

The implementation will have to handle different branch ranges within
the same architecture, on ARM alone there are at least 3 ranges that
we'd have to handle within the same link [*].

At this stage I'd like to keep it simple and just handle the case of
redirecting a relocation to a target T to a thunk that will ultimately
transfer control to T.

I do have an interest in relaxation, primarily as a way to implement
errata patching, overwrite instruction with a branch to a patch thunk,
then return to a following instruction. However I think that is
something that can be built later as it is a generalisation of what we
would need for range extension thunks.

[*] A later revision of the ARM ELF document clarified that the linker
should not add Thunks for 16-bit Thumb branches, it is the
responsibility of the code-generator/author to only use these
instructions when the range to the target can be guaranteed. See Call
and Jump relocations on Page 33