LLD Linker Section Packing

Embedded systems often have heterogeneous memory regions available to place sections into, and GNU/LD largely determine those placements syntactically. Decisions are made using the names, types, and filenames of the input section, but not their sizes or the amount of space available in the regions. Accordingly, if too many sections are assigned to the same region, the link will fail. This requires a degree of manual assignment in the linker script to ensure everything fits.

Manually laying out input sections can be fairly onerous in practice, since when input files change sizes, the linker script can suddenly fail to link the binary due to memory region overflows. To make the process more automatic, various embedded toolchain vendors have extended, modified, or replaced the GNU LD linker script semantics. GNU LD itself has a flag to help with this as well, though it hasn’t made it into LLD.

I wanted to open discussion about whether any of these alternatives could be implemented in LLD. I’ve provided a brief survey of options implemented by major embedded vendors. It’s necessarily incomplete, but should help aid discussion.

GNU LD --enable-non-contiguous-regions

When this flag is passed to GNU LD, if the first section definition in the linker script assigns it to a memory region where it cannot fit, the match is aborted. In that case, the section remains unassigned until the next matching section definition, and the process continues from there. The flag --enable-non-contiguous-regions-warnings emits diagnostics whenever the flag changes the allocation of a section (noisy).

Linker-generated sections and sections where the size changed due to relaxations are not allowed to fall through in this fashion. If one of these cannot fit in the first matched section, then the link fails.

Discussion

This approach requires special handling to use in practice. Since there isn’t any way to specify a limit on the size of input sections that can match an earlier section definition, sections will be assigned to it until the underlying region is filled. Accordingly, there is no general way to allocate something after the automatically filled portion.

Because of this fill-until-full semantics, in order to accommodate the common pattern of filling a logical region with CODE, RODATA, DATA, and HEAP in varying amounts, the logical region would need to be subdivided into several memory regions, e.g., one for CODE, one for RODATA, etc. The end of these regions could be allocated dynamically using automatic section splitting; this would allow giving each region a manually-specified budget.

TI Linker

See Section 3, “Automatic Section Splitting”

The TI linker appears to be broadly GNU LD compatible, but it adds an extension to split input sections across multiple memory regions. The syntax is:

.text { ... } >> FIRST_REGION | SECOND_REGION | THIRD_REGION

Discussion

Although the syntax differs, the semantics of this are broadly similar in their implications to the GNU LD flag. This acts as a separable extension to the linker script syntax, leaving the behavior of existing constructs alone, while the GNU LD flag modifies the behavior of regular section definitions.

MPLAB XC16

See Section 10.5, “Linker Allocation.”

Packing done largely through orphan sections; these are packed into existing memory regions with a best-fit allocator. Otherwise, broadly GNU LD compatible. Custom section attributes affect this packing… the details are complex and unlikely to generalize well.

This also includes gaps left over by alignment in sequential allocation; these are transparently packed too.

Discussion

This changes the behavior of orphan sections from GNU LD, rather than changing the behavior of general section definitions like the LD flag. Similar semantic concerns apply as with those approaches.

ARM Scatter File

https://developer.arm.com/documentation/dui0474/m/scatter-loading-features/how-the-linker-resolves-multiple-matches-when-processing-scatter-files?lang=en

ARM has specificity rules like CSS for matching input sections to output sections. Generally, the most specific rule wins, independent of order.

https://developer.arm.com/documentation/dui0474/m/scatter-loading-features/placement-of-unassigned-sections-with-the--any-module-selector?lang=en

Sections are assigned to regions .ANY selectors using a selectable packing algorithm, e.g., first_fit, best_fit, etc. ANY_SIZE allows limiting the maximum size of an .ANY selector, and priorities can be given to each.

ARM scatter files differ from GNU LD-style linker scripts in that the ordering of sections within memory regions is not given by the order of the section specifiers; rather, the sections are sorted by type. The FIRST and LAST specifiers can be used to place a set of sections first or last within the type.

Discussion

The approach used by ARM scatter files could not be used directly in LLD, since there is a general expectation of sequential assignment. This can be used to directly control the ordering and and addresses of sections, and it’s necessary that any automatic placement mechanism preserve this property.

Via the FIRST and LAST property and the implicit sorting by type, this approach does allow placing contents after the variable region. Implementing this well would likely imply some kind of lazy-addressing or backtracking. The addresses of the sections after a variable region could not be known until the size of the variable region is known, but the size of the variable region cannot be determined until the amount of space available for allocation is determined, which depends on the size of everything after the variable region. Alignment would also throw complexity into the mix.

From Arm we would definitely like to see something like one of the above options implemented in LLD. I have a lot of experience of the armlink scatter-file options and the various trade-offs but I don’t have any experience with the other extensions.

Personally I’d prefer to not apply rules globally via a command line option and to use additional syntax in the linker script. However doing that does mean either introducing incompatibilities with GNU ld (still the most popular choice for open-source embdedded projects), or finding a compromise with the binutils community on what the right syntax is. I would expect that most open-source projects will not want to be tied to an LLD extension.

I’d separate out the section ordering part of scatter-loading. That is primarily there due to the scatter-loading notation existing before ELF so it is possible to write selectors that mix up RO, RW and ZI and these need to be sorted into compatible types and flags. Linker Scripts work directly with output sections of compatible type and flags so they don’t have to do this.

The most specific selector wins is something that I think could be of use to lld, although it would likely have to be put behind an option, or perhaps by some additional notation in the input section description. A typical scatter file will match sections not by name, but by attribute like +ro-code (ELF flags AX) and rely on more specific *(.specific_section_name) to override the general.

The place where this would be most useful is with memory mapped register. A common idiom, for better or worse, is to place a variable over the register and place it at the right address using a linker script. We need to make this .bss as we don’t actually want anything placed at the address, the variable is just a convenient way to access the memory mapped register.

__attribute__((section(".bss.my_memory_mapped_peripheral"))) type register_name;

Sadly the .bss prefix is needed to get clang or gcc to use SHT_NOBITS. The intention is to place these with a linker script

  my_memory_mapped_peripheral 0x<address> : { *(.bss.my_memory_mapped_peripheral) }

Linker scripts tend to have a pattern.

.bss : { *(.bss .bss.*) }

Where the .bss.* tends to match these sections before they can be placed at their intended address. The only workaround I know for this is to not use the .bss prefix and then put NOLOAD on the Output Section, however LLD gives a warning when the section isn’t .bss and this can cause problems that link with fatal warnings.

The two downsides of most specific selector wins are:

  • Defining what most specific means. In most cases non-wildcard preferred to wildcard is usually all that is needed.
  • Performance. With a naive implementation every section needs to be checked against every selector rather than finishing when it has been matched. In most cases this won’t matter but there could be corner cases where a more efficient algorithm is needed.

I can imagine something like .ANY working in linker scripts with the addition of a size limit. For example:

  .text.1 : { * (.ANY(*.text.*), 0x1000) }
  ...
  .text.2 : { * (.ANY(*.text.*), 0x2000) }

Making this up on the fly, following armlink’s rules all non .ANY selection is done first. Leaving only orphan sections as candidates for .ANY. Depending on what algorithm used for .ANY (I think it is a bin packing problem) .text.1 could be filled up until 0x1000 bytes were used then .text.2 until 0x2000 bytes were used up. In scatter-loading the linker uses the equivalent of the maximum size of the OutputSection, although there is no way of actually setting that in a linker script, which is why I think a lot of the alternative schemes work off MEMORY regions. Personally I would prefer to set the max size on the selector pattern.

My experience of .ANY is that very few people change the default algorithm or priority, although it is very useful to a handlful of projects. Using the maximum size of the OutputSection can cause problems as at section selection time you don’t know what the affect of alignment padding will be, or how many range extension thunks you’ll need so it helps to have some contingency.

I would be happy to help with any design, implementation or review of these features. I guess the key thing is coming up with something that can ideally be acceptable to GNU binutils and the LLD maintainer is happy to accept the additional complexity.

I’m very interested in this as well but want to point out that there are often performance implications of the different memory regions. I think it’s the linker’s job to know about these trade-offs.

We’re starting to use the iMX RT for CircuitPython, which is a Python VM on bare metal. The iMX RT are a Cortex-M7 core clocked between 400 and 1000 MHz. There are basically three tiers of memory:

  1. Tightly coupled memory (TCM) runs at the core speed.
  2. OCRAM runs at 1/4 core speed or less but is accessible to more peripherals.
  3. Execute in place (XIP) flash. Much slower than RAM but much larger. (SPI flash running at 120 MHz and four bits per clock.) A user modifiable file system is also on the flash and programming it prevents XIP access temporarily.

2 and 3 may be accessed through a cache that runs at core speed too. The cache’s are of limited size between 8 and 32k. TCM sizes are configurable between 32k and 256k depending on the part.

Function calls between 1 and 3 cause “veneer” or “trampoline” code to bridge the wide address space gap.

In CircuitPython, the VM portions are always used by user code but peripheral specific functionality may not be. Currently, we use linker scripts with binutils to place in the different regions. We want VM code in TCM because it always is used and needs to be fast. (The error handling portions could be in XIP.) Code that runs while flash is disabled to always be in RAM (maybe TCM). All of the “user code may use this” code should live on flash and hopefully :crossed_fingers: stay in cache once loaded.

So, ideally the linker would place all explicit sections into TCM first (due to hazards likely) and then fill the rest of the space with the hottest code first. That’d allow us to have one linker script apply across varying TCM sizes.

Note, that there is a separate issue with using sections to designate functions that need to be in RAM because the linker will happily link those to functions and/or data in the region of memory that is supposedly inaccessible.