[RFC] .symtab_meta - a .symtab extension to communicate symbol metadata from the compiler to the linker

snidertm · October 3, 2024, 3:31pm

Summary

I discuss the limitations in the current ELF object file format for representing metadata information that applies to a given symbol. Encoding symbol metadata into a .symtab_meta section is presented as an alternate method/mechanism that is more powerful and flexible for the task at hand. The .symtab_meta section is currently used in the implementation of the location, noinit and persistent attributes as well as several other attributes in various target compiler toolchains supported by the Texas Instruments (TI) compiler tools group. The existing TI implementation is ELF-specific but could easily be adapted to work within other object formats as well.

I then summarize a proposal to upstream support for the .symtab_meta extension to the LLVM source base.

Motivation

In an ELF object file, the information associated with a given symbol is represented in a standard ELF symbol table entry. An Elf32_Sym record currently stores information about a symbol’s

name (st_name) - an index into the string table
value (st_value) - its actual integer value (if symbol is absolute) or the address where the symbol is defined (if symbol is weak or global)
size (st_size) - size in bytes based on the type of the symbol
type and binding (st_info) - binding in most significant 4-bits and type in least significant 4-bits, where type refers to a data object, function, section, or source file.
8 extra bits (st_other) - ELF specification indicates that this field is normally filled with 0 and has no defined meaning
section (st_shndx) - index into the section header table pointing to the section where the symbol is defined

Compiler-generated DWARF debug information provides additional information about a given symbol. For example, the DWARF information entry for the symbol will be annotated with both a DW_AT_location and a DW_AT_type attribute that describes the runtime location of a symbol’s definition and a symbol’s data type, respectively.

There is additional information that can be associated with a symbol to assist the linker with the proper handling of a symbol. For example, the clang compiler supports the retain attribute that instructs the linker to keep the definition of a symbol in a link even if it isn’t referenced elsewhere in the application. In the compiler-generated object file, this attribute is propagated to the section in which the symbol is defined. It is encoded as the SHF_GNU_RETAIN section flag in the sh_flags field of an ELF section header record (Elf32_Shdr or Elf64_Shdr).

The sh_flags field of an ELF section header is to be interpreted as a bitmask, which means that any general and/or processor-specific semantic information relevant to a given section is limited to the number of bits in the sh_flags field.

An additional limitation to representing semantic information about sections or symbols in a bitmask is that a symbol attribute that has an associated value cannot be represented with only a single bit in the sh_flags field.

An Alternative Representation of Symbol Metadata Information

There is a more expressive and flexible way of representing symbol metadata information in an ELF object file.

Consider an embedded application that defines a data object that must reside at a specific location at run time. The TI compiler toolchains support a location attribute that can be specified as follows:

__attribute__((location( 0x12345678 ))) int my_located_var = 10 ;

The TI compiler will generate a symbol metadata record into a special section named .symtab_meta that contains:

a symbol table index pointing to the symbol that this meta-data applies to
the kind of symbol meta-data; an integer representation instead of a bitmask
the value associated with the meta-data; for a location kind of meta-data, the value would be an address

There are ways to get around some of the above-mentioned limitations. For example, the Arm Ltd. compiler supports an at attribute that is identical to the above location attribute in intent:

__attribute__((at( 0x12345678 ))) int my_located_var = 10 ;

Instead of encoding the value of the at attribute in a flag or an extension to the symbol table, the Arm Ltd compiler puts the definition of the variable into a section whose name is annotated with the specified address argument.

However, there are additional benefits to the proposed .symtab_meta approach to representing extra semantic information about symbols. Not only does representing the metadata kind as an integer vastly increase the number of different kinds of symbol metadata information that can be represented in an object file, but also:

It is not limited in the number of different kinds of symbol metadata information that can be applied to the same symbol
It is not limited in the kinds of values that can be associated with a given piece of symbol metadata information; a value field could be:
- an integer value (as in the above example)
- a string table index (e.g. a string encoding of format specifiers associated with printf-like function calls)
- a symbol table index (e.g. indicating a symbol-to-symbol alias mapping)

Encoding symbol metadata as an extension to the symbol table enables many capabilities that are particularly useful for embedded applications, such as:

Communication of placement-related information from compiler to linker
- profile-based placement or explicit user-directed placement - expressing the preferred memory type in which to place a symbol (e.g. TCM, on-chip, off-chip, etc)
- specific placement - Arm Ltd’s at attribute / TI’s location attribute
Communication of special initialization semantics
- TI’s noinit and persistent attributes
Link-time function specialization
- Boot routine
- Run-time initialization
- memset, memcpy specialization
- printf specialization

The majority of the above capabilities have already been implemented and are used today in the TI compiler toolchains.

Proposal

I propose to upstream support for this symbol metadata extension to the symbol table to the LLVM source base. This entails:

Providing a mechanism to opt-in/opt-out of including this support in a given toolchain
Encoding symbol attributes and semantic information, that is not otherwise already represented, into compiler generated object files, specifically in a .symtab_meta section consisting of an array of fixed-length symbol metadata information records
Adding support for symbol metadata assembly directives that encode symbol metadata information into an object file that is generated from the assembler
Adding support for generating symbol metadata assembly directives when compiling to assembly
Adding support to edit the .symtab_meta section in conjunction with edits to the symbol table in llvm-objcopy
Reading, processing, applying the contents of a .symtab_meta section in the lld linker

A specification of the proposed .symtab_meta section and fixed-length symbol metadata records follows.

.symtab_meta Section and Symbol Metadata Records

.symtab_meta Section

The section table entry for the .symtab_meta section will contain values that are particularly relevant to the .symtab_meta section:

sh_name = index into string table pointing to “.symtab_meta” string
sh_type = SHT_SYMTAB_META (a new section header type)
sh_addr = 0 (.symtab_meta section is not loaded into target memory)
sh_link = index of .symtab section in the section header table

Symbol Metadata Records

Elf32_SymMeta

typedef struct {

Elf32_Word sm_info;

Elf32_Word sm_value;

} Elf32_SymMeta;

where:

Index of symbol associated with metadata: index = ((Elf32_Word)sm_info >> 8);
Symbol metadata kind: kind = (sm_info & 0xff);
A sub-range of kind identifiers will be reserved for processor-/toolchain-specific use
Interpretation of sm_value field depends on kind

Elf64_SymMeta

typedef struct {

Elf64_Xword sm_info;

Elf64_Xword sm_value;

} Elf64_SymMeta;

where:

Index of symbol associated with metadata: index = ((Elf64_Xword)sm_info >> 16);
Symbol metadata kind: kind = (sm_info & 0xffff);
A sub-range of kind identifiers will be reserved for processor-/toolchain-specific use
Interpretation of sm_value field depends on kind

History and Rationale

I submitted an RFC for an earlier and less sophisticated method of encoding symbol metadate directly into a symbol table entry in April of 2019 (RFC - a proposal to support additional symbol metadata in ELF object files in the ARM compiler - Project Infrastructure / LLVM Dev List Archives - LLVM Discussion Forums). Feedback from that RFC thread was incorporated into what has evolved into this RFC.

An earlier version of this .symtab_meta specification was proposed for inclusion in the upstream GCC source base, but did not get adequate support for approval at the time. The specification has since been refined and the usage of .symtab_meta in TI compiler tools has expanded into areas such as support for user-directed placement attributes in source code.

Extending the symbol table with the .symtab_meta section is a piece of infrastructure that has significant value, especially for embedded toolchains and applications.

tannewt · October 7, 2024, 5:46pm

I really like the extensibility of this metadata approach.

Where would you document the various keys that are supported and their behavior? I’d love more info for our embedded llvm use.

jyknight · October 7, 2024, 11:12pm

Is it actually useful to define this as a “generic” symbol-metadata table? With only 8 or 16-bits of “kind”, and a fixed-size value that’s either 32-bit or 64-bit, it doesn’t really seem like it’d be a generically-useful amount of extensibility? Could the other existing various “symbol metadata” sections we have in LLVM even use this format, if it existed at the time they were introduced?

Separately from that overall question – probably this new table should be using a relocation to point to the symbol? Because in some previous discussions on the “llvm_addrsig” section, there was a fair amount of grumbling that we did not use relocations to specify the symbols. That means any ELF-processing tools which add/remove symbols from the symbol-table will corrupt the addrsig section. Using relocations was undesirable at that time, because of the space overhead – but making the relocation overhead small was one of the goals of the recently-introduced CREL relocation format.

snidertm · October 8, 2024, 10:40pm

The intent is to provide a common format for representing symbol metadata. Some toolchain features depend on the capability to communicate information from the compiler to the linker. If there are such features that someone in the LLVM community is willing to make available to any toolchain that might find it valuable, then they’d only have to use the ld.lld linker or update their downstream linker to comprehend the .symtab_meta section to get the feature.

The proposed format also provides for a downstream toolchain to implement a feature that they don’t want to upstream. Having a generic symbol metadata representation will at least reduce the amount of toolchain-specific code that they have to develop, and it is also likely to reduce the number of merge conflicts encountered as they maintain their implementation.

A “kind” of 8 or 16 bits seems reasonably generous, providing up to 255 or 64k different symbol metadata kinds, respectively. I don’t have a set in stone idea about how many of these kinds would be processor- or toolchain-specific.

Re: fixed-length symbol metadata, an earlier version of the symbol metadata record uses ULEB128 encodings for the kind, symbol index, and value fields. This will help reduce the size of the symbol metadata section, but I worried that the LLVM and GNU communities would be less amenable to a variable-length record. Maybe I was mistaken?

I haven’t seen or I am not aware of other existing symbol metadata sections that are flexible enough to support different types of metadata, some of which would require one or more additional values to be interpreted based on the kind.

I’d appreciate if you can direct me to any of these that are already available in the upstream LLVM sources.

I admit this kind of information feels like relocation types, but I think the proposal allows for more flexibility.

For example, suppose you want to support an attribute to express that a given function or variable is to be placed in a particular user-named memory region. Memory regions are defined in a linker script and are arbitrarily named by an application developer. The attribute information could be carried to the linker in the object file (assuming that IR is not available in the file that is input to the linker), the value field becoming a string table index to the named memory region. The linker could verify that the specified memory region is valid (defined in the linker script).

I agree that you do incur some maintenance cost with things like llvm-objcopy where a user could edit the symbol table. Making one of the fields of a symbol metadata record a symbol table index does mean that you would have to keep the symbol indices in the symbol metadata records in sync with edits to the symbol table.

I’m not familiar enough with CREL relocation format to understand if or how it would help with this issue. Does CREL also require an encoding of the symbol index?

~ Todd

snidertm · October 9, 2024, 4:09pm

Some of the kinds (or “keys”) that we support in the TI toolchains include:

SMK_NOINIT - identify a variable placed at a specific address in non-volatile memory as not to be initialized at load time or reset
SMK_PERSISTENT - identify a variable placed at a specific address in non-volatile memory as initialized at load time, but not to be re-initialized at reset
SMK_LOCATION - instruct the linker to place a section containing the definition of a function or variable at a specific address
SMK_RETAIN - instruct the linker to retain a section in which a symbol is defined (this has been obsoleted and replaced by an upstream implementation of a section header flag SHF_GNU_RETAIN)
SMK_OPTPRINTF_FORMATSTR - tell the linker what format specifiers are used in calls to printf-like functions to facilitate specialization of RTS printf support function
SMK_OF_PLACEMENT - instruct the linker to place a function or variable in a type of memory (e.g. local or TCM, onchip, offchip)
SMK_PRE_POST_PAD - instruct the linker to insert padding before and after a function section

“SMK” refers to “symbol metadata kind”

A given symbol metadata kind will be associated with a linker feature. A given feature may need more than one symbol metadata kind to support it.

Some of these would be considered processor- or toolchain-specific kinds. Features that are implemented upstream, intended for general use, should be associated with symbol metadata kinds that are NOT processor- or toolchain-specific.

A processor-/toolchain-specific range of symbol meatdata kind IDs will be set aside within the range of available kinds (a la ELF section header flags). My inclination is to reserve half of the available IDs for processor-/toolchain-specific symbol metadata kinds.

~ Todd

jh7370 · October 10, 2024, 8:31am

Hi @snidertm,

Thanks for the proposal. It sounds interesting. Do you expect this to be supported more broadly outside of the LLVM toolchain (and your own one)? If so, this should really be proposed to the ELF gABI list. If, on the other hand, it is intended to be kept local to LLVM, the names and values of section header types etc should be vendor-specific (i.e. the value should be in the vendor-specific range, the type name should start with SHT_LLVM_, and the section name should probably start with .llvm_).

I haven’t looked at the format details in particular, but I do wonder whether many of the SMK values you mention should really be section flags? The linker tends to work at a section level, with symbols being “carried along” with the sections they are in. This is why, for example, the RETAIN functionality is a section flag, not anything else. Anything to do with location in memory really is a section property - the compiler should generate the relevant symbol in its own section with the appropriate section type and flags. Otherwise, if you end up with two symbols in the same section with conflicting properties, the linker won’t be able to process them correctly.

I think there likely is a case for the metadata section at some point, but there needs to be a current motivation that cannot be solved by other existing mechanisms before it gets adopted. Looking at the existing SMK values you listed, I don’t really see that motivation*:

SMK_NOINIT - this is comprised of two parts: the “specific address” part and the “not to be initialized” part. The “specific address” part really should be a section flag, possibly combined with leveraging the unused-in-an-ET_REL object sh_addr field to specify the specific address. The “not to be initialized” part sounds like just a kind of .data (in contrast with .bss, which is zero-initialized).
SMK_PERSISTENT - Same comment as above re. specific address. “Initialized at load time” - initialized to what? A dynamically relocated value? In which case, it’s just another form of .data, supported by relocations.
SMK_LOCATION - See above.
SMK_RETAIN - We won’t want to support this upstream, since it is obsolete.
SMK_OPTPRINTF_FORMATSTR - This one I don’t have any particular notes on except that it sounds like it’s quite niche? What do you mean by “facilitate specialization”?
SMK_OF_PLACEMENT - This is probably better off being a section type/flag, for the same reasoning as the “specific address” stuff. I’m working on the assumption that there’s a linker script that knows about memory types (or similar), because the core linker wouldn’t know about this.
SMK_PRE_POST_PAD - To clarify, is this talking about something similar to alignment padding? Regardless, this is a section property as described in the detail, so should become a section flag of some kind, probably.

* I don’t really understand what you mean by “(un)initialized at load time” and “reset” in the context of these SMK values. Could you clarify what you mean by this? ELF supports two kinds of data: zero-initialized by the loader (i.e. .bss) and set to a value (either fixed either by relocations at static link time or by the compiler, or to a dynamically-relocated value at load time). My comments above may not be quite correct for the feature, given my lack of understanding here.

tannewt · October 10, 2024, 6:21pm

How would this get propagated to other sections? We often want a section to be runnable from a given section (because it may disable other memory.) That means all referenced sections also need to be marked. Would the compiler do that or the linker (because it knows memory layout)?

Also, could this symbol metadata be a way to pass performance info to the linker for tightly coupled memory placement? Maybe we’d need basic block sections instead of function sections at that point.

snidertm · October 11, 2024, 10:17pm

Hi James,

Thanks for the questions and comments.

Responses below …

~ Todd

As a piece of common infrastructure, I think it would be worthwhile to propose this as an official extension to ELF. It has been proven useful for embedded applicaltions, but I’m not sure if the supporting argument is strong enough for general applications.

I agree that if it is only going to be implemented in LLVM and/or GNU, then the SMK kinds and the generated symbol metadata section name will get vendor-specific annotation.

It is true that many kinds of symbol metadata are essentially flags (retain, noinit, persistent) that could very well be represented as section header flags. However, there are kinds of symbol metadata that are not representable as flags.

For example, consider the SMK_OPTPRINTF_FORMATSTR kind. In our libc implementation, printf-like functions (fprintf, printf, sprintf, etc) all call a common helper function that is responsible for handling the format string. There are multiple implementations of this helper function available. A small version of the helper function handles simple format specifiers, a large version includes support for floating-point format specifiers in addition to all others. Depending on what format specifiers are used in the application, the linker can select a small, medium, or large version of the helper function at link time. This is facilitated by having the compiler record the format specifiers accompanying calls to printf-like functions and passing this information to the linker. There is another SMK that I haven’t mentioned to help with calls to printf-like functions that pass only a string pointer as the format string argument. If the linker is unable to safely decide that a smaller helper function is adequate, the large helper function is chosen.

To reiterate, the symbol metadata kind in this use case cannot be represented by a flag. It requires an accompanying value.

The chief motivation for this proposal is to support symbol metadata kinds that require an accompanying value that is to be interpreted based on the kind. Something that is not supported by section header flags.

As I point out in the proposal, another disadvantage of section header flags is that they are limited in the number of flags available because they are implemented as a bitmask.

With regards to the situation where two symbols with conflicting properties end up being defined in the same section, I believe that such conflicts can be easily detected and diagnosed at link time.

One additional note … it is interesting to consider that a retain attribute is applied to a symbol in the source code (e.g. C, C++), and one could argue that the retain characteristic belongs to that symbol alone, but the current implementation turns the retain attribute into a section header flag that also applies to any other symbol that is defined in that section. If you don’t generate functions and symbols into their own sections or if you have not enabled garbage collection, then every function and variable defined in that section is going to be retained. This can have significant code size implications. My point being that propagating symbol properties to the sections that they are defined in could have unintended costs.

I would argue that the SMK_OPTPRINTF_FORMATSTR kind and the feature that it supports (discussed above) fits your stated criteria to justify a metadata section.

The “noinit” attribute is combined with a “location” attribute in the source code, so both the SMK_NOINIT and SMK_LOCATION kinds apply to a symbol that has these attributes. The “location” value is attached to the SMK_LOCATION symbol metadata record.

I don’t necessarily agree that “specific location” should be a section flag. There are other ways of indicating the placement of a section to a specific address. For example, a linker script can specify the placement of an output section to a specific address. In our downstream linker, placement instructions for an output section are represented in a linker-internal data structure. So, the “location” attribute becomes an SMK_LOCATION symbol metadata record in the object file, and the linker records this information as a placement instruction that is associated with the section that the symbol is defined in.

I went into some detail about the printf specialization feature above.

Again, the preference for a particular memory type is being specified as an attribute in the source code. The identification of the type could be a separate section flag of its own, but then we’re using up more section flags when the symbol metadata record has the flexibility to handle as many memory types as you can invent.

Another potential attribute might be to indicate the name of a memory region you want a function or variable to be allocated to. That use case seems better suited to a symbol metadata record since you’d want to allow an application developer to be able to name their own memory regions in a linker script.

And yes, the linker would correlate the memory type or memory region name to a definition of the type or region in the linker script that is being used to build a given application.

Sort of. It is more to do with creating a padding buffer around a function that you’d like to make more secure. You might fill the padding with instructions that are guaranteed to trigger a hardware exception if executed.

The fact that you want padding before and after a function could be a section header flag, but you also may want to tell the linker the minimum size of the padding (assuming alignment would potentially increase the pre-padding). In order to represent the size value feels like a natural fit for a symbol metadata record.

In our downstream linker, we support different auto-initialization models:

load-time - RW data in .bss is zero-initialized; RW data encoded in .data section is loaded directly to its target memory location
run-time - RW data in .bss is zero-initialized; linker generated initialization records (.cinit) that are processed as part of the boot routine to initialize the .data section

In the run-time auto-initialization model, when the device running the application is “reset”, then the boot routine gets called from the reset interrupt vector.

A variable marked with the “noinit” attribute does not get initialized at all (it will likely be a variable that is allocated to a special address - peripheral, memory mapped register, etc). A variable that is marked with the “persistent” attribute will only be initialized when the application is initially loaded. It does not get a .cinit record, so it is not re-initialized when the device is “reset”. Hope this helps to clarify.

MaskRay · October 12, 2024, 12:22am

Minor:

The proposal seems to mix sm_info and smi_info, which seems like typos.

typedef struct {
Elf64_Xword sm_info;
Elf64_Xword sm_value;
} Elf64_SymMeta;

This part seems wrong as well.

Index of symbol associated with metadata: index = ((Elf64_Xword)smi_info >> 48);
Symbol metadata kind: kind = (smi_info & 0xffff);

The proposal does not address the needs of generic symbol metadata.
I can add more reasoning beyond the previous discussion in 2020
https://groups.google.com/g/generic-abi/c/QPgYf3-_Iyw (where the same structure has been proposed).

As jyknight noticed, a single value member might work for a specific use case like TI’s location attribute, but it seems insufficient for the broader vision of a generic metadata section.

Furthermore, a 16-bit symbol kind is insufficient, and the need for a registry to assign kinds hinders the convenience of using “generic” metadata.

The existing .llvm_addrsig section suffers from limitations.
Tools like ld -r and objcopy can inadvertently invalidate it due to symbol table modifications.
Ideally, relocations could enable certain operations, but size constraints in REL/RELA prevent this approach.
Additionally, the proposed {sm_info, sm_value} structure resembles Elf64_Rel and isn’t size-efficient either.

However, there are promising alternatives.
For instance, the representation for SHT_LLVM_CALL_GRAPH_PROFILE has been changed to use one xword associated with two relocations (R_*_NONE with sym0/sym1 information).
When SHT_CREL is enabled, the output is size-efficient.

If a symbol metadata information can be handled by a dumb linker, section content associated with SHT_CREL relocations is a very size-efficient encoding.
In some cases, the addend member of SHT_CREL can be utilized.

However, when more complex metadata is needed that cannot be represented effectively with relocations, a dedicated section might be more suitable.

snidertm · October 12, 2024, 4:25pm

snidertm:

Summary

I discuss the limitations in the current ELF object file format for representing metadata information that applies to a given symbol. Encoding symbol metadata into a .symtab_meta section is presented as an alternate method/mechanism that is more powerful and flexible for the task at hand. The .symtab_meta section is currently used in the implementation of the location, noinit and persistent attributes as well as several other attributes in various target compiler toolchains supported by the Texas Instruments (TI) compiler tools group. The existing TI implementation is ELF-specific but could easily be adapted to work within other object formats as well.

I then summarize a proposal to upstream support for the .symtab_meta extension to the LLVM source base.

Motivation

In an ELF object file, the information associated with a given symbol is represented in a standard ELF symbol table entry. An Elf32_Sym record currently stores information about a symbol’s

name (st_name) - an index into the string table

value (st_value) - its actual integer value (if symbol is absolute) or the address where the symbol is defined (if symbol is weak or global)

size (st_size) - size in bytes based on the type of the symbol

type and binding (st_info) - binding in most significant 4-bits and type in least significant 4-bits, where type refers to a data object, function, section, or source file.

8 extra bits (st_other) - ELF specification indicates that this field is normally filled with 0 and has no defined meaning

section (st_shndx) - index into the section header table pointing to the section where the symbol is defined

Compiler-generated DWARF debug information provides additional information about a given symbol. For example, the DWARF information entry for the symbol will be annotated with both a DW_AT_location and a DW_AT_type attribute that describes the runtime location of a symbol’s definition and a symbol’s data type, respectively.

There is additional information that can be associated with a symbol to assist the linker with the proper handling of a symbol. For example, the clang compiler supports the retain attribute that instructs the linker to keep the definition of a symbol in a link even if it isn’t referenced elsewhere in the application. In the compiler-generated object file, this attribute is propagated to the section in which the symbol is defined. It is encoded as the SHF_GNU_RETAIN section flag in the sh_flags field of an ELF section header record (Elf32_Shdr or Elf64_Shdr).

The sh_flags field of an ELF section header is to be interpreted as a bitmask, which means that any general and/or processor-specific semantic information relevant to a given section is limited to the number of bits in the sh_flags field.

An additional limitation to representing semantic information about sections or symbols in a bitmask is that a symbol attribute that has an associated value cannot be represented with only a single bit in the sh_flags field.

An Alternative Representation of Symbol Metadata Information

There is a more expressive and flexible way of representing symbol metadata information in an ELF object file.

Consider an embedded application that defines a data object that must reside at a specific location at run time. The TI compiler toolchains support a location attribute that can be specified as follows:

__attribute__((location( 0x12345678 ))) int my_located_var = 10 ;

The TI compiler will generate a symbol metadata record into a special section named .symtab_meta that contains:

a symbol table index pointing to the symbol that this meta-data applies to

the kind of symbol meta-data; an integer representation instead of a bitmask

the value associated with the meta-data; for a location kind of meta-data, the value would be an address

There are ways to get around some of the above-mentioned limitations. For example, the Arm Ltd. compiler supports an at attribute that is identical to the above location attribute in intent:

__attribute__((at( 0x12345678 ))) int my_located_var = 10 ;

Instead of encoding the value of the at attribute in a flag or an extension to the symbol table, the Arm Ltd compiler puts the definition of the variable into a section whose name is annotated with the specified address argument.

However, there are additional benefits to the proposed .symtab_meta approach to representing extra semantic information about symbols. Not only does representing the metadata kind as an integer vastly increase the number of different kinds of symbol metadata information that can be represented in an object file, but also:

It is not limited in the number of different kinds of symbol metadata information that can be applied to the same symbol

It is not limited in the kinds of values that can be associated with a given piece of symbol metadata information; a value field could be:

an integer value (as in the above example)

a string table index (e.g. a string encoding of format specifiers associated with printf-like function calls)

a symbol table index (e.g. indicating a symbol-to-symbol alias mapping)

Encoding symbol metadata as an extension to the symbol table enables many capabilities that are particularly useful for embedded applications, such as:

Communication of placement-related information from compiler to linker

profile-based placement or explicit user-directed placement - expressing the preferred memory type in which to place a symbol (e.g. TCM, on-chip, off-chip, etc)

specific placement - Arm Ltd’s at attribute / TI’s location attribute

Communication of special initialization semantics

TI’s noinit and persistent attributes

Link-time function specialization

Boot routine

Run-time initialization

memset, memcpy specialization

printf specialization

The majority of the above capabilities have already been implemented and are used today in the TI compiler toolchains.

Proposal

I propose to upstream support for this symbol metadata extension to the symbol table to the LLVM source base. This entails:

Providing a mechanism to opt-in/opt-out of including this support in a given toolchain

Encoding symbol attributes and semantic information, that is not otherwise already represented, into compiler generated object files, specifically in a .symtab_meta section consisting of an array of fixed-length symbol metadata information records

Adding support for symbol metadata assembly directives that encode symbol metadata information into an object file that is generated from the assembler

Adding support for generating symbol metadata assembly directives when compiling to assembly

Adding support to edit the .symtab_meta section in conjunction with edits to the symbol table in llvm-objcopy

Reading, processing, applying the contents of a .symtab_meta section in the lld linker

A specification of the proposed .symtab_meta section and fixed-length symbol metadata records follows.

.symtab_meta Section and Symbol Metadata Records

.symtab_meta Section

The section table entry for the .symtab_meta section will contain values that are particularly relevant to the .symtab_meta section:

sh_name = index into string table pointing to “.symtab_meta” string

sh_type = SHT_SYMTAB_META (a new section header type)

sh_addr = 0 (.symtab_meta section is not loaded into target memory)

sh_link = index of .symtab section in the section header table

Symbol Metadata Records

Elf32_SymMeta

typedef struct {

Elf32_Word sm_info;

Elf32_Word sm_value;

} Elf32_SymMeta;

where:

Index of symbol associated with metadata: index = ((Elf32_Word)smi_info >> 8);

Symbol metadata kind: kind = (smi_info & 0xff);

A sub-range of kind identifiers will be reserved for processor-/toolchain-specific use

Interpretation of sm_value field depends on kind

Elf64_SymMeta

typedef struct {

Elf64_Xword sm_info;

Elf64_Xword sm_value;

} Elf64_SymMeta;

where:

Index of symbol associated with metadata: index = ((Elf64_Xword)smi_info >> 48);

Symbol metadata kind: kind = (smi_info & 0xffff);

A sub-range of kind identifiers will be reserved for processor-/toolchain-specific use

Interpretation of sm_value field depends on kind

History and Rationale

I submitted an RFC for an earlier and less sophisticated method of encoding symbol metadate directly into a symbol table entry in April of 2019 (RFC - a proposal to support additional symbol metadata in ELF object files in the ARM compiler - Project Infrastructure / LLVM Dev List Archives - LLVM Discussion Forums). Feedback from that RFC thread was incorporated into what has evolved into this RFC.

An earlier version of this .symtab_meta specification was proposed for inclusion in the upstream GCC source base, but did not get adequate support for approval at the time. The specification has since been refined and the usage of .symtab_meta in TI compiler tools has expanded into areas such as support for user-directed placement attributes in source code.

Extending the symbol table with the .symtab_meta section is a piece of infrastructure that has significant value, especially for embedded toolchains and applications.

The SMK_OF_PLACEMENT kind does not propagate a memory type preference to the descendants of the function symbol that it is directly applied to. We had considered it, but the particular device that motivated the attribute did not support a use case where some of the memory could be disabled.

I would anticipate that another symbol metadata kind could be invented to fit the use case that you describe. The linker would be the more likely candidate for carrying out the propagation of the memory type preference to descendant symbols and sections since it has a call graph available.

If an LTO recompile is being done as part of the build, then the compiler could add a pass that propagates a memory type preference to the descendants of a function that has a memory type preference attribute.

For passing performance information, I imagine the compiler could generate symbols to mark the basic blocks and generate associated performance information metadata for each basic block in a function.

If you’re talking about allocating some basic blocks into tightly coupled memory and allocating other basic blocks to other types of memory, then this sounds like hot/cold-splitting that can be done in the compiler.

I don’t know the details of how hot/cold-splitting works, but conceptually I would suspect the compiler generates separate sections for hot and cold parts of a function. The hot and cold sections probably have symbols defined at their front. The compiler could attach a memory type preference to those symbols.

jh7370 · October 14, 2024, 7:39am

Thanks for the responses. I think broadly-speaking, my concerns revolve around the additional complexity required in the linker to support each of the different SMK values that could exist (in upstream tools). @MaskRay can probably comment on this more, as I haven’t really been involved with LLD’s development in years, but I think we’ve generally tried to avoid anything that requires the linker to be smart and do anything especially complex, especially when there are alternatives that can be dealt with elsewhere in the toolchain. The gc-sections behaviour is a good example: it would be possible to discard symbols without requiring them to be in their own sections, but this is complicated and costly, as the linker has to split the section up and read the relocations to identify which individual symbols can be discarded, then somehow reconstruct the remaining bits of the sections - we had this behaviour in the Sony proprietary linker, and it wasn’t nice, due to there being so many edge cases apart from anything else.

Traditionally, linkers have relied on relocations and, more frequently, linker scripts to get most of this complicated behaviour. For example, padding can be added (and the value of the padding specified) for sections in linker scripts. I acknowledge that this doesn’t allow for specifying the padding in your source code, although I could imagine that the padding could perhaps be a constant that in an input file that could be read by the linker script, as an example.

Another thought I had was that symbol metadata is limited to symbols. Are there more general behaviours that need to be communicated from source through to the linker that can’t be captured in the current form? Once upon a time, a proposal for a linker command section was discussed, to enable this sort of communication, but it was never properly supported in LLVM.

snidertm · October 14, 2024, 4:43pm

Understood. I think the thread of this RFC has gone a bit too far into the details of how particular features are implemented in the linker and away from the primary intent of the proposal, which is to offer a mechanism for communicating linker-relevant information (placement preferences, symbol/section relevant semantics, etc) from the compiler to the linker.

The above discussion of the SMKs was intended to illustrate examples of features that could be supported by this mechanism. The SMKs described are things that have been implemented in the downstream TI linker. They are useful in embedded applications but are probably not features that would be appropriate to upstream. For example, the SMK_OPTPRINTF_FORMATSTR in particular is a niche feature.

I suspect there are not many SMKs that are truly generic. An exception might be the SMK_OF_PLACEMENT kind. It enables the source code to tell the linker to allocate a particular function to particular type of memory. I imagine that this could be incorporated into a hot/cold-splitting optimization, where an attribute can be used to place a function’s hot section in fast memory and its cold section in slower memory. It could also be applicable to profile-guided placement.

I agree that this should be the general rule-of-thumb for the upstream linker sources. Downstream linkers, especially those in support of embedded applications, do need differentiating features that may require some complexity in their implementation.

The only things that are coming to mind would require the compiler to have awareness of a device’s memory layout. Maybe some of this could be inferred from the target triple if the target triple is sufficient to identify a specific device?

snidertm · October 14, 2024, 5:57pm

Thanks for catching the typos. I have made the relevant corrections in the original post.

The linked to discussion gets overly concerned with the implementation of linker features that are being facilitated by specific symbol metadata kinds. The proposal referenced in the link, like this proposal, is originated from TI.

The intent of this proposal, as mentioned in my reply to @jh7370, is to offer a mechanism for communicating linker-relevant information from the compiler to the linker.

I’m not sure how a 16-bit symbol [metadata] kind is insufficient. The proposal states that the kind field is to be interpreted as an integer ID.

Yes, the implementation calls for a registry to assign kind IDs. I agree that this hinders the convenience of using generic metadata. One could argue that there is precedent for using a registry in this way as exemplified by many of the flag-type fields in the ELF specification.

Perhaps a way around this is to use a string representation of the symbol metadata kind. This could be represented in a symbol metadata record as a string table index.

A concern that was raised earlier on this thread by @jyknight was that using an index into the symbol table as a field in the symbol metadata record means that symbol metadata records will have to be kept in sync for any application (like llvm-objcopy) that can edit the symbol table. I did not find this to be particularly difficult to deal with, but it is an extra maintenance burden.

Yes, symbol metadata records are very much like relocations, but they don’t patch values in an object file’s encoding. Instead, they carry instructions from the compiler to the linker.

I suppose if we thought of relocations and symbol metadata records as two derivations of the same base class, then there would be quite a bit of overlap in how they are generated by object producers and how they are processed by object consumers.

Re: size efficiency … I had implemented the symbol metadata records using ULEB128 values for kind, symbol table index, and value.

Thanks for the feedback!

jh7370 · October 15, 2024, 7:30am

A part of me is wondering whether a SHT_NOTE section with a specific name could be sufficient for the desired communication mechanism, given that there doesn’t seem to be much, if anything, that an upstream linker would do with this section. However, it doesn’t solve the issue of llvm-objcopy having to know about the symbol indexes or risking messing it all up.

snidertm · October 15, 2024, 11:05pm

This is interesting. I like the fact that SHT_NOTE is already a defined ELF section header type, so there would be no need to introduce a new type.

However, the ELF spec Note Section includes this verbiage:

Note information is optional. The presence of note information does not affect a program’s ABI conformance, provided the information does not affect the program’s execution behavior. Otherwise, the program does not conform to the ABI and has undefined behavior

Since a given symbol metadata record may affect run-time behavior (for example, whether a symbol is re-initialized on device reset), does that disqualify a table of symbol metadata records from being encoded as an SHT_NOTE section?

MaskRay · October 16, 2024, 6:14am

snidertm:

Thanks for catching the typos. I have made the relevant corrections in the original post.

MaskRay:

The proposal does not address the needs of generic symbol metadata.
I can add more reasoning beyond the previous discussion in 2020
https://groups.google.com/g/generic-abi/c/QPgYf3-_Iyw (where the same structure has been proposed).

The linked to discussion gets overly concerned with the implementation of linker features that are being facilitated by specific symbol metadata kinds. The proposal referenced in the link, like this proposal, is originated from TI.

The intent of this proposal, as mentioned in my reply to @jh7370, is to offer a mechanism for communicating linker-relevant information from the compiler to the linker.

MaskRay:

Furthermore, a 16-bit symbol kind is insufficient, and the need for a registry to assign kinds hinders the convenience of using “generic” metadata

Ideally, relocations could enable certain operations, but size constraints in REL/RELA prevent this approach.
Additionally, the proposed {sm_info, sm_value} structure resembles Elf64_Rel and isn’t size-efficient either.

I’m not sure how a 16-bit symbol [metadata] kind is insufficient. The proposal states that the kind field is to be interpreted as an integer ID.

Yes, the implementation calls for a registry to assign kind IDs. I agree that this hinders the convenience of using generic metadata. One could argue that there is precedent for using a registry in this way as exemplified by many of the flag-type fields in the ELF specification.

Perhaps a way around this is to use a string representation of the symbol metadata kind. This could be represented in a symbol metadata record as a string table index.

A concern that was raised earlier on this thread by @jyknight was that using an index into the symbol table as a field in the symbol metadata record means that symbol metadata records will have to be kept in sync for any application (like llvm-objcopy) that can edit the symbol table. I did not find this to be particularly difficult to deal with, but it is an extra maintenance burden.

While it’s understandable to seek a general solution, I urge us to consider the practical implications of a generic scheme.

(a) A registry for kind IDs is essential. This makes it not particularly appealing than adding a new section type.
Some kind IDs require mandatory linker processing, while others are optional, necessitating a numbering convention.
Using strings adds overhead and contradicts the ELF philosophy of using section types rather than names.

(b) As discussed, many symbol table metadata section scenarios require more than a single xword. These scenarios will have their own conventions or rely on relocations, making a generic scheme less beneficial.
In fact, I suspect that many cases will fall into this category, leading me to question the practicality of a generic proposal.

(c) Some prominent GNU contributors, who were in the generic ABI discussions, have yet to express enthusiasm for this approach.
Given my long-standing subscription to the binutils project, I believe the likelihood of them adopting .symtab_meta is quite low.
Even when introducing SHT_LLVM_* sections, we must carefully consider interoperability with the GNU toolchain.

snidertm:

MaskRay:

However, there are promising alternatives.
For instance, the representation for SHT_LLVM_CALL_GRAPH_PROFILE has been changed to use one xword associated with two relocations (R_*_NONE with sym0/sym1 information).
When SHT_CREL is enabled, the output is size-efficient.

If a symbol metadata information can be handled by a dumb linker, section content associated with SHT_CREL relocations is a very size-efficient encoding.
In some cases, the addend member of SHT_CREL can be utilized.

Yes, symbol metadata records are very much like relocations, but they don’t patch values in an object file’s encoding. Instead, they carry instructions from the compiler to the linker.

I suppose if we thought of relocations and symbol metadata records as two derivations of the same base class, then there would be quite a bit of overlap in how they are generated by object producers and how they are processed by object consumers.

Re: size efficiency … I had implemented the symbol metadata records using ULEB128 values for kind, symbol table index, and value.

Thanks for the feedback!

[/quote]

GNU objcopy and similar binary manipulation tools fix relocation symbol indexes when changing the symbol table.
This makes a relocation-based scheme appealing.

Every architecture defines an R_*_NONE relocation, a marker type. We could leverage this existing mechanism, similar to how SHT_LLVM_CALL_GRAPH_PROFILE utilizes it.

You might want to take a look at CREL, which could be used as a symtab metadata replacement. The format utilizes ULEB128 and delta coding well.
It is promising to be accepted by GNU folks, though someone needs to create patches.

While using a standard section code remains an challenge
(Evolution of the ELF object file format | MaskRay),
achieving consensus GNU and LLVM may be sufficient, even without formal approval from the generic ABI (nobody has authority anyway).

jh7370 · October 16, 2024, 7:41am

I actually find this quote hard to understand, so I’m not sure. In particular, I’m uncertain whether it is intended to refer to the final linked image, or it includes input objects. Assuming it only refers to the linked image, that gives the static linker space to interpret that section and discard it.

ShankarEaswaran · October 16, 2024, 5:42pm

The metadata section appears to be sparse, do you need this to be a seperate section, you can create an auxiliary section to convey the symbol to additional metadata information (like .llvm_addrsig etc).

For the case where you are looking to place the symbols at a specified address, you would need to be looking to place the section at the specified address. There could be more than one symbol in the section.

How does this work for bitcode ?

Symbol resolution rules based on the metadata also may need to be modeled.

quic-akaryaki · October 16, 2024, 6:36pm

I may have missed if this have been mentioned, but I wonder if the existing format of the attributes section can be used as in .riscv.attributes and .ARM.attributes. I believe there is single encoding, and attributes can be associated with files, sections, and symbols. However, only file-level attributes are supported by LLVM, but adding support for symbol attributes can be beneficial.

snidertm · October 16, 2024, 7:59pm

MaskRay:

GNU objcopy and similar binary manipulation tools fix relocation symbol indexes when changing the symbol table.
This makes a relocation-based scheme appealing.

Every architecture defines an R_*_NONE relocation, a marker type. We could leverage this existing mechanism, similar to how SHT_LLVM_CALL_GRAPH_PROFILE utilizes it.

You might want to take a look at CREL, which could be used as a symtab metadata replacement. The format utilizes ULEB128 and delta coding well.
It is promising to be accepted by GNU folks, though someone needs to create patches.

While using a standard section code remains an challenge
(Evolution of the ELF object file format | MaskRay),
achieving consensus GNU and LLVM may be sufficient, even without formal approval from the generic ABI (nobody has authority anyway).

Thanks. You make a compelling argument for using relocation entries to carry symbol metadata information in the object file. The kind, symbol index, and value fields of the originally proposed symbol metadata record map easily to the type, symbol table index, and addend fields of an Elfxx_Rela record.

I have looked into CREL some and I agree that it is a file-size-efficient, viable alternative to fixed-length relocation entries.

I think my next step is to transition our TI downstream implementation for symbol metadata records to relocation entries, then determine what makes sense to upstream.

Topic		Replies	Views
RFC - a proposal to support additional symbol metadata in ELF object files in the ARM compiler LLVM Dev List Archives	22	185	May 10, 2019
Help adding entries to .symtab LLVM Dev List Archives	4	131	January 21, 2018
RFC - a proposal to support additional symbol metadata in ELF object files in the ARM compiler LLVM Dev List Archives	0	88	April 30, 2019
question on generating dwarf metadata LLVM Dev List Archives	12	89	December 10, 2010
[PATCH/DRAFT] Embed metadata into object file LLVM Dev List Archives	6	100	April 6, 2016

[RFC] .symtab_meta - a .symtab extension to communicate symbol metadata from the compiler to the linker

Summary

Motivation

An Alternative Representation of Symbol Metadata Information

Proposal

.symtab_meta Section and Symbol Metadata Records

.symtab_meta Section

Symbol Metadata Records

Elf32_SymMeta

Elf64_SymMeta

History and Rationale

Related topics