[RFC] Keep GlobalValue GUIDs stable

Problem

Several optimizations need a stable identifier for global values throughout compilation, thinlink, and recompilation. This notably includes PGO (instrumented and sampled). In addition, ThinLTO needs identifiers for indexing its summary, in both thin link and the ThinLTO backend.

Currently we use the GlobalValue GUID, but these are not stable. They’re a hash constructed from the name, linkage, and source filename, and these properties are not stable in all circumstances. In particular, ThinLTO function importing can change them. If module A defines an external function Foo which references an internal symbol Bar, then after importing Foo into module B we need to promote Bar to external so that it’s visible from B, and change its name to ensure uniqueness. Conversely, if ThinLTO finds a symbol with external linkage which doesn’t need to be external, it will change its linkage to internal.

This means that GUIDs are not guaranteed to be stable across different phases of the LTO / PGO pipeline. Without stable GUIDs, users have various workarounds. For instance, InstrProf stores its own metadata and uses that to construct a stable unique identifier, and for sample PGO we use -funique-internal-linkage-names to include a unique module ID in the name. This flag in particular is problematic as changing the names of internal symbols breaks assumptions (for instance, in the Linux kernel where assembly depends on symbol names).

These mechanisms are multiple, special-purpose, add a large space overhead, and do not work in all situations. Ideally we’d have a simple, small, unique identifier for any GlobalValue that we can use wherever necessary.

Proposed solution

The proposal is to make the GlobalValue GUID stable, and replace the special-purpose mechanisms with GUIDs. GUIDs are a good choice as they already exist for all GlobalValue objects, and are compact at only 8 bytes each.

To ensure the GUID remains stable, we will compute it once, up-front, in the module which defines the GlobalValue. The resulting value will be stored as metadata attached to the GlobalValue. This metadata will be used when constructing the module summary index. If a function is imported, its GUID metadata will be imported alongside. ThinLTO will use the pre-computed GUIDs when building its index, rather than computing them itself. Any changes to linkage at the function importing stage will have no effect on the GUID, as it is never recomputed. Any later accesses of the GUID will use the value stored in the metadata.

A section will be added to the binary which associates symbols with their GUID, to enable sample-based profiling to use the same GUIDs. This will cost 8 bytes per symbol (but note however that for sample PGO, removing the existing metadata / name mangling will save at least that much).

Plan

This is a somewhat wide-reaching change, as GUIDs are used in dozens of places across the codebase. We intend to roll this change out via the following steps:

Milestone 1: ThinLTO

As noted above, there are several users who require unique identifiers. The plan is to migrate ThinLTO first, as a proof of concept.

  1. Add a new LTO pre-link pass which computes a GUID for each GlobalValue and saves it as metadata in the module. The GUID computation will remain unchanged for values with external linkage, so existing users will get a matching GUID in this case (otherwise, they might not, as is the existing behaviour today).
  2. Change GlobalValue::getGUID to retrieve the value from the metadata (assert-failing if it’s not present).
  3. Use the metadata when constructing the module summary index, and in the LTO backend.
    • Note that this will require an extra side table in the bitcode. Function importing lazily loads functions, importing metadata only with the body. Before the body is imported, we won’t have the metadata and hence the GUID. This side table will provide the GUID immediately. This could be a new table, or an extra column in the existing value symbol table, details TBD.

Milestone 2: Other PGO users

With milestone 1 completed, we can be confident that this approach will work, and we can migrate other users piecemeal, as time permits. For each one we can remove its bespoke means of obtaining unique identifiers once they’re migrated. It is expected that not all users of GUIDs will need to be migrated to use this metadata, as the existing method will work unchanged for external symbols, and in many cases the assumption that the symbol is external will always hold.

9 Likes

In the absence of other more substantial feedback, I will break the silence and say, sounds great, ship it. =D

Yes, the Linux kernel does tend to insistently rely on internal compiler implementation details, but maybe the best thing to do is grumble about it, accommodate it, and move on with life, as you suggest.

I might be missing something here, but is 8 bytes enough to ensure uniqueness if you’re randomly generating the ID? Due to the birthday paradox, you have a significant chance of a collision once the symbol count gets into the millions.

| efriedma-quic
February 26 |

  • | - |

I might be missing something here, but is 8 bytes enough to ensure uniqueness if you’re randomly generating the ID? Due to the birthday paradox, you have a significant chance of a collision once the symbol count gets into the millions.

It isn’t, but we have designed things that already rely on the GUID (e.g. ThinLTO) to be conservatively safe in the face of conflicts. This proposal doesn’t change the reliance on GUIDs, it simply ensures they stay stable across the build process.
Teresa