[RFC] PC-Keyed Metadata at Runtime

melver · August 1, 2022, 9:13am

Summary

Various semantic information available at source or at LLVM IR level may be lost when lowering and generating target-specific code. As such, it becomes impossible to recover such information without additional metadata stored elsewhere that can map instruction and function addresses, viz. program counters (PCs), to metadata of interest.

Such metadata can aid in more accurate runtime binary analysis that requires knowledge of source-level information (e.g. atomic vs. plain accesses in data race detection). Similarly, source-level debug information must be stored (e.g. as DWARF) alongside the binary to recover useful debugging information. Unfortunately, debug information is not guaranteed to be present in a binary (it may be stripped), nor is it efficient to arbitrarily extend and store new metadata: both factors are crucial for metadata that is required at runtime affecting the correct and fast operation of a program.

We propose a mechanism to efficiently generate and store arbitrary PC-keyed metadata associated with IR instructions that can be retrieved at runtime. The following discusses background and motivation in more detail, followed by design of the core feature, followed by the first concrete use case.

An earlier discussion that led to this RFC may be found here.

Background and Motivation

To perform certain detailed runtime binary analysis on an otherwise unmodified binary, semantic metadata is required that is lost when generating machine code. For example, data race detection requires knowledge of atomic accesses to avoid false positives. For deployment in production, however, this metadata needs to be stored in the binary and needs to be accessible efficiently at runtime: the presence of the metadata should not affect performance of the binary unless it is accessed, and overall binary size should be minimally impacted. Therefore the metadata will require storage in separate loadable sections, with size having priority over extensibility, backwards compatibility, or human readability. Crucially, for some deployment scenarios, the presence of the metadata is required for the correct and fast operation of a program (this is unlike traditional debug information, which may be stripped).

Use cases. Most of the immediate use cases are to generate PC-keyed semantic metadata for sampling-based error detectors aka. sanitizers, that if disabled, have zero overhead. The first such sanitizer will be a variant of GWP-TSan, but other GWP-Sanitizers (such as an UBSan and MSan variant) that require language-level semantic information are planned. Other binary instrumentation tools, such as Valgrind, Helgrind, or DRD could also benefit from PC-keyed metadata.

Challenges. The main challenge here is that instruction PCs will only be known in the backend during code generation, yet the semantic information of interest is only known in the frontend or middleend: propagating for which instructions PC-keyed metadata should be emitted to the backend is non-trivial. The implementation should also take care to work well with the linker garbage collector (GC), such that if associated code is dropped, the metadata is dropped, too. Finally, the encoded PCs should be stored as efficiently as possible, avoiding relocations if possible (which adds size and linker overheads).

Related Features

Similar metadata is emitted by some of the following:

SanitizerCoverage’s PC Table feature constructs a list of basic block entry PCs with attached metadata in the __sancov_pcs section of the binary.
The -basic-block-sections feature records metadata about each basic block in the .llvm_bb_addr_map section of the binary for use by profilers and debuggers.

The commonality here is that these only work on basic block addresses, and not individual instructions. No existing feature easily allows emitting PCs of individual instructions.

Design

The most scalable design is to allow attaching MDNodes to arbitrary IR instructions and functions, where the attached metadata is propagated through to the AsmPrinter which then interprets the metadata and generates code to emit the metadata in the binary. The metadata itself is stored in arbitrary sections determined by the information stored in the metadata.

More concretely, we introduce PC sections metadata which can be attached to IR instructions and functions, for which addresses, viz. program counters (PCs), are to be emitted in specially encoded binary sections. Metadata is assigned as an MDNode of the MD_pcsections kind (!pcsections). The format and encoding (see below) of !pcsections metadata is kept generic, so that different kinds of PC-keyed metadata can be translated to a !pcsections metadata node. Therefore, we only need to take care to propagate !pcsections metadata from IR instructions to replacement IR instructions and generated machine IR (MIR), and no special logic is required for different kinds of PC-keyed metadata.

Metadata propagation. The biggest challenge is to losslessly propagate !pcsections through IR transformations, from IR to machine IR (MIR), and through MIR transformations in the backend, through to the AsmPrinter. The problem is similar to propagating debug info. In many cases both debug info and !pcsections metadata should be copied together: for generation of MachineInstrs, we modify BuildMI() to simplify the propagation of debug info and !pcsections metadata together.

IR-to-IR transformations: The current use cases only intend to add !pcsections metadata after all IR optimizations. As such, no special care is taken to preserve !pcsections metadata through IR transformations yet. One notable exception is the AtomicExpandPass which runs after optimizations right before instruction selection, which we update to preserve !pcsections metadata for all replacement instructions (see patch).
IR-to-MIR lowering: MachineInstrs will allow setting !pcsections metadata via MachineInstr::setPCSections(), which stores the MDNode pointer out-of-line in MachineInstr::ExtraInfo, to avoid bloating MachineInstr in the common case (see patch). The BuildMI() MachineInstr builder is updated to take a bundle of debug info and !pcsections metadata as MIMetadata, which simplifies copying both from IR and MIR instructions (see patch).
- SelectionDAG: Before lowering to MachineInstrs, SelectionDAG lowers instructions to SDNodes. As such, we need to introduce the ability to store !pcsections metadata in SDNodes during IR-to-SD lowering. SelectionDAG provides several callbacks that simplify propagating metadata on DAG transformations (via ReplaceAllUsesWith, see patch; and via DAGUpdateListener, see patch).
- FastISel: Because there is no intermediate representation between LLVM IR instructions and MIR instructions, on instruction selection with FastISel the metadata is copied through MIMetadata and all BuildMI() calls are updated. Implementing FastISel support is relatively straightforward: FastISel::DbgLoc is replaced with an MIMetadata instance to copy debug info and !pcsections metadata together (see patch).
- GlobalISel: Like FastISel, requires updating BuildMI() calls in various locations (see patch).

Metadata format. An arbitrary number of interleaved MDString and constant operators can be
added, where a new MDString always denotes a section name, followed by an arbitrary number of auxiliary constant data encoded along the PC of the instruction or function. The first operator must be a MDString denoting the first section.

  !0 = metadata !{
    metadata !"<section#1>"
    [ , iXX <aux-consts#1> ... ]
    [ metadata !"<section#2">
      [ , iXX <aux-consts#2> ... ]
      ... ]
  }

The occurrence of “section#1”, “section#2”, …, “section#N” in the metadata causes the backend to emit the PC for the associated instruction or function to all named sections. For each emitted PC in a section #N, the constants aux-consts#N will be emitted after the PC.

Binary encoding. Instructions result in emitting a single PC, and functions result in emission of the start of the function and a 32-bit size. This is followed by the auxiliary constants that followed the respective section name in the MD_pcsections metadata.

To avoid relocations in the final binary, each PC address stored at entry is a relative relocation, computed as pc - entry. To decode, a user has to compute entry + *entry. The size of each entry depends on the code model. With large and medium sized code models, the entry size matches pointer size. For any smaller code model the entry size is just 32 bits.

With the metadata emitted by the SanitizerBinaryMetadata pass (discussed in the next section), a study on several of the largest binaries deployed at Google showed that a naive implementation without relative relocations (and entries of regular size of 64 bits) resulted in an overall binary size increase of >10%, which was unacceptable. The proposed version with relative relocations results in an overall binary size increase of less than 2%.

Use case

The first use case will be a middleend pass, SanitizerBinaryMetadata (see patch), that will emit PC-keyed metadata for use by a set of new sanitizers. The first such sanitizer will be a variant of GWP-TSan, but other GWP-Sanitizers (such as an UBSan and MSan variant) that require language-level semantic information are planned.

GWP-TSan will require knowledge of which instructions have been lowered from C11 and C++11 atomics, to avoid generating false positive data race reports. For now, the new pass supports generating PC-keyed metadata about atomic instructions, and which semantic features have been analyzed per function. The latter metadata enables mixing code for which no PC-keyed metadata exists with code where PC-keyed metadata has been enabled without producing false positive reports.

The plan is to open source a stable and production quality version of GWP-TSan and other GWP-Sanitizers. The development of which, however, requires upstream compiler support. Until the first tool has been open sourced, we mark this kind of instrumentation as “experimental”, and reserve the option to change binary format, remove features, and similar. Until that time, PC-keyed metadata via SanitizerBinaryMetadata can be emitted with the frontend flag -fexperimental-sanitize-metadata.

Implementation

Phabricator patch series:

Additionally a Git tree with the implementation is available here.

Enna1 · August 2, 2022, 1:16pm

I see the devmtg-2020 GWP-TSan talk says GWP-TSan is built on top of GWP-ASan.
Is this varirant of GWP-TSan you mentioned is still built on GWP-ASan? Is there any difference between this varirant and the GWP-TSan posted in devmtg-2020 ?
Thanks!

melver · August 2, 2022, 2:13pm

It won’t be built on top of GWP-ASan. The current implementation will be based on watchpoints and a new Linux kernel API we introduced last year after learning that the GWP-ASan based variant simply isn’t the best design: Add support for synchronous signals on perf events [LWN.net] (in mainline since Linux 5.13)

The idea is still the same, but a few core details will be different. All GWP-TSan variants need the atomics metadata and the compiler support discussed in this RFC.

We’re still working on the GWP-TSan implementation which isn’t final yet, but to get there we’ll need the compiler support.

rnk · August 5, 2022, 6:11pm

Thanks for this proposal! The ability to mark instructions with labels and collect the PCs in some metadata section has been requested by researchers since I first began working on LLVM. You probably noticed that we already have some support for tracking labels on call instructions for the !heapallocsite metadata.

I suspect that, if and when this feature lands, researchers will immediately start using this feature regardless of any warnings you make about how IR transforms have not been updated to handle this metadata. Before we let that happen, I think it’s really important to nail down what we think the semantics of this metadata really should be, and what IR transforms should do if we were to update them to preserve the metadata.

So, that’s my main request: Please document the intended semantics. From your post, I get the idea that the metadata should be retained on an instruction when it is replaced with another, it should be retained when duplicated, and lost when the instruction is deleted. One cannot, for example, use this feature to hotpatch code in ways that change program semantics. You can patch the code, but it should be transparent to optimizers, like an instrumentation pass to count executed loads.

melver · August 8, 2022, 12:15pm

Thank you for taking a look! We are quite curious if you still have references or hints to some of the use cases that researchers had been interested in for such PC-keyed metadata.

Regarding documentation of semantics, I’ll revise the documentation and add you as a reviewer when I’ve amended the patch.

preames · August 8, 2022, 3:56pm

There’s a conceptual split here which is really important.

Mandatory information. Dropping this information can effect the semantics of the program. As such, it must be preserved at the cost of optimization loss.
Optional information. Dropping this information can be done without effecting the semantics of the program. Preserving it is strictly best effort.

Both have valid use cases, but knowing which you’re dealing with is critical.

There is significantly more prior art here than was mentioned. A couple in particular to be aware of:

!loc as used by HHVM’s LLVM backend a few years ago. This seems the closest to what you’ve described, but note this implementation got the semantic split above wrong. It was used in a context where information was mandatory, but the implementation was best effort.
deopt and gc operand bundles. This is an in-tree, supported, and mature implementation of mandatory (not optional) runtime metadata. Note that these will (by design) prevent optimizations if required to preserve information.
!make.implicit and FaultMaps. This is an in-tree, supported, and mature implementation of a best effort semantic which late in the pipeline triggers a transform which then becomes mandatory.

If you are choosing a metadata based implementation, you are choosing the optional semantics. (This is fundamental to the choice of representation.) You may be perfectly fine with that, but you need to be aware of it, and you need to make sure the documentation clearly conveys that the information is optional and can be dropped without changing semantics.

melver · August 8, 2022, 4:43pm

This is a good point and we need to document this as well. We definitely do not want to affect optimizations or generated code outside PC sections, which is the reason for choosing the metadata-based implementation.

For our use cases, the information can be dropped, in which case the sanitizers wanting the information may report false positives or result in false negatives, which is annoying, but the program will continue functioning.

I will document that users of the metadata have to be tolerant to lost metadata, although every effort is taken to preserve it. Additionally, if there was a better way to detect if metadata was lost, we can fail more gracefully and/or fix metadata propagation.

preames · August 8, 2022, 6:03pm

I will document that users of the metadata have to be tolerant to lost metadata, although every effort is taken to preserve it.

Please do not use this exact wording. The whole point here is that we will not take “every effort” to preserve metadata. We will instead drop them wherever convenient.

Additionally, if there was a better way to detect if metadata was lost, we can fail more gracefully and/or fix metadata propagation.

Have been down this road before; could not make it work. You might have better luck, but I don’t advise this approach.

You know your use case better than I do, but I would not expect an optional semantics for a sanitizer. Having the tool report both false positives and false negatives due to unrelated compiler changes feels less than ideal.

davidxl · August 8, 2022, 7:47pm

AutoFDO’s peudo-probe is another similar feature. @WenleiHe

melver · August 9, 2022, 1:09pm

Do you think there’s a difference between preserving metadata in IR-to-IR transformations, IR-to-MIR, and MIR-to-MIR transformations?

I think we’re on the same page with IR-to-IR transformations. But for IR-to-MIR and MIR-to-MIR transformations, I think LLVM could do better. The patches I posted show it’s not entirely unreasonable (see BuildMI() change). (We are primarily interested in attaching !pcsections after IR optimizations.)

preames · August 9, 2022, 3:37pm

See the example of implicit-null-checks I mentioned earlier. You can switch from the optional model to the mandatory model at some point in the pipeline, just be aware that’s what you’re doing. As a general rule of thumb, metadata is not mandatory at any stage. (I think? Feel free to correct me if this is wrong.)

So, yes, having a model where the semantics become mandatory before codegen is completely reasonable, but no, using metadata is probably not the right approach.

For clarity, a proposal which merges attributes and metadata, and explicitly introduced the semantic distinction between optional and mandatory kinds (which is largely implicit in the use of metadata vs attributes, and with some hard coded exceptions on attributes) would be a hugely useful cleanup. If you wanted to do that, then maybe we could use metadata for mandatory semantics in some cases after all.

melver · August 10, 2022, 8:56am

The general concept of metadata on LLVM IR Instructions is indeed “lossy”. However, MachineInstrs are different. It doesn’t have a notion of arbitrary metadata, and instead we need to add explicit storage for specific attributes in MachineInstrs (patch). What we end up doing with these new attributes on MachineInstrs is now up to us. What I’m proposing (through changes to instruction selectors), is to propagate these extra attributes on simple MIR transformations.

The guarantees we can then provide are:

propagation through IR transformations is not guaranteed;
if certain metadata kinds on IR instructions reach the instruction selectors, they will be preserved through simple MIR transformations (through usage of new BuildMI());
we will not affect generated code outside the “PC sections”, which is a top priority (e.g. the implicit-null-checks changes to the mandatory model through MIR transformation, which is not desirable for us).

Can you clarify what attributes and metadata you mean? Is this on (middleend) IR instructions?

On a whole this discussion highlighted that we need to be careful about the guarantees we make, but also need to be clearer about how LLVM IR and MIR differ.

htyu · August 15, 2022, 10:06pm

Most of the immediate use cases are to generate PC-keyed semantic metadata for sampling-based error detectors aka. sanitizers, that if disabled, have zero overhead.

This sounds interesting. I wonder if the PC-keyed metdata can be use to filter out unnecessary sampling at runtime. A bit more context: we are currently exploring sampling-based value profiling. We would like to restrict the sample traces to a few program points of interest (e.g, a particular callsite or expression) to avoid excessive sampling. To achieve that we are looking at enhancing the profiler tool (such as Linux perf) with a functionality that samples instructions for only particular addresses based on an address section on the binary file. Do you see such need on your end?

melver · August 16, 2022, 8:54pm

That’s an interesting use case. It’s not something we’ve considered right now, but the support for !pcsections metadata would allow for your use case as well.

One thing to keep in mind that the addresses in the PC sections support we’re adding are encoded in a special way (see the documentation) to keep binary size manageable, but if the data is loaded by something like perf, then I don’t see an issue.

melver · August 17, 2022, 7:46pm

melver:

The guarantees we can then provide are:

propagation through IR transformations is not guaranteed;

if certain metadata kinds on IR instructions reach the instruction selectors, they will be preserved through simple MIR transformations (through usage of new BuildMI());

we will not affect generated code outside the “PC sections”, which is a top priority (e.g. the implicit-null-checks changes to the mandatory model through MIR transformation, which is not desirable for us).

preames:

For clarity, a proposal which merges attributes and metadata […]

Can you clarify what attributes and metadata you mean? Is this on (middleend) IR instructions?

On a whole this discussion highlighted that we need to be careful about the guarantees we make, but also need to be clearer about how LLVM IR and MIR differ.

I have added a section on guarantees in ⚙ D130875 [Metadata] Introduce MD_pcsections (see section “Guarantees on Code Generation” in documentation).

This should be in line with existing guarantees, where metadata remains optional in LLVM IR, but once we lower to MIR, select metadata becomes mandatory (bundled in MIMetadata). The new BuildMI() helper should in future, to make intent explicit, be changed to enforce passing MIMetadata bundles vs. accepting implicit DebugLoc/DILocation. I would prefer to make this transition incrementally, but if anyone has strong opinions on this, please do say.

melver · June 4, 2024, 3:41pm

For future reference: We recently open sourced the sampling-based sanitizer framework discussed here: GitHub - google/gwpsan: GWPSan: Sampling-Based Sanitizer Framework

Topic		Replies	Views
[RFC] Semantic changes in the Metadata/Value split LLVM Dev List Archives	7	109	December 5, 2014
[RFC] Profile Guided Static Data Partitioning IR & Optimizations llvm	12	1132	January 23, 2025
[RFC] Heterogeneous Debug Info LLVM Project debuginfo	7	908	January 20, 2023
Metadata in LLVM back-end LLVM Dev List Archives	29	223	June 23, 2021
RFC: Log Symbolizer LLVM Project debuginfo	5	750	June 6, 2022