[RFC] Safer Whole Program Class Hierarchy Analysis

Summary

Currently we cannot ensure safe whole program class hierarchy analysis in the presence of native objects/libaries during the (Thin)LTO link, for whole program devirtualization (WPD) and future related optimizations, as native objects do not always contain symbols identifying overridden base classes. To address this issue we propose adding the necessary information to a special section in the native object (ELF initially) that will be communicated from lld to LTO for analysis. Specifically, for WPD this would be the ThinLTO bitcode summary records identifying the type metadata and vtables in the native object. In the future, this can be extended with additional information as needed to support whole program analysis beyond WPD.

Background

Both ThinLTO and regular LTO optionally perform whole program devirtualization (WPD) based on class hierarchy analysis. Regular LTO uses the IR, whereas ThinLTO uses summary information, and an additional hybrid LTO mode (with split LTO modules) uses a combination.

It is only safe to perform WPD when we know that the LTO unit contains the full class hierarchy for the class type containing the virtual call target of an indirect call. Otherwise, we may incorrectly devirtualize a call that could actually target an unseen virtual function override.

In LLVM we employ several mechanisms to detect situations where the entire class hierarchy may not be visible during the whole program LTO link, and avoid devirtualizing in those cases, as described below.

Hidden visibility class types

Whole program devirtualization is generally enabled for types with hidden LTO visibility. By default this is just for those types that have hidden symbol visibility (e.g. those declared in anonymous spaces). It is always safe to assume that these cannot be overridden outside the LTO unit.

This is also the case when compiling with -fvisibility=hidden, which asserts that all symbols should have hidden visibility. However, assertion that must be made carefully, when we know that the resulting IR object files will only be linked into a statically linked binary that does not link any shared libraries or other native objects that would need the symbols in the LTO unit or contain any overrides of its virtual functions.

Refining LTO visibility for statically linked binaries

For statically linked targets that do not link in shared libraries, other than e.g. system libraries that we know will not override user code types, it is generally safe to apply hidden visibility to all sources. However, in a build system where we may build multiple targets in a single build invocation, each C++ source file may feed multiple LTO links with varying visibility requirements.

To enable more aggressive use of hidden LTO visibility where appropriate, a mechanism was implemented to enable refinement of the LTO visibility on vtables at LTO link time under a new –lto-whole-program-visibility option. We use this mechanism to automatically enable WPD more aggressively for binaries linked statically (no shared libraries except for some system libraries).

By delaying the visibility refinement to hidden until the LTO link action, we can also take better advantage of build system caching, so that the bitcode objects generated by ThinLTO compiles can be shared by more LTO links across build invocations.

However, like -fvisibility=hidden, this is also an assertion, and must be applied only when appropriate.

Problem Description

Problems can arise if ThinLTO is disabled for any source files that may contain virtual function overrides. Occasionally it might be disabled for certain libraries or individual source files to work around a compiler or build system limitation, for example.

This can lead to incorrect devirtualizations if the –lto-whole-program-visibility LTO link option is applied, as shown in the following example:

base.h
class Base {
  virtual int foo();
};

base.cc
#include "base.h"
int Base::foo() { return 1; }
int bar(Base *B) { return B->foo(); }

derived.cc
#include "base.h"
class Derived : public Base {
  int foo() { return 2; }
};

int baz() {
  Derived d;
  return bar(&d);
}

main.cc
#include "base.h"
#include <stdio.h>
extern int baz();
int main() {
  Base b;
  printf("%d %d\n", bar(&b), baz());
}
# No LTO, correct output
$ clang++ *.cc -O
$ a.out
1 2

# ThinLTO+WPD all files, correct output
$ clang++ -c *.cc -flto=thin -fwhole-program-vtables -O
$ clang++ *.o -flto=thin -Wl,--lto-whole-program-visibility -fuse-ld=lld
$ a.out
1 2

# Recompile derived.cc without ThinLTO and link with ThinLTO objects
$ clang++ -c derived.cc -O
$ clang++ *.o -flto=thin -Wl,--lto-whole-program-visibility -fuse-ld=lld
# Incorrect output!
$ a.out
1 1

The reason for the incorrect output in the last case is that the thin link (ThinLTO whole program step) has no idea that class Base is overridden in derived.cc. Therefore we think there is a single implementation and incorrectly devirtualize the callsite in bar().

We cannot simply look at uses of base class vtables in native objects. The linker sees which symbols are referenced from native objects, and this information is already communicated to LTO which needs to preserve any defintions for uses outside of the IR being LTO linked. Unfortunately, it is not required that the vtable symbol be referenced from another object in order for that type’s virtual functions to be overridden. In the above example, derived.o does not contain any reference to Base’s vtable.

Even if the base class vtable is not referenced by the native object, by default the typeinfo symbol for the base class will be. However, when compiling with -fno-rtti, typeinfo symbols are not created, so we cannot rely on looking for references to those symbols in native objects.

For example, when compiling derived.cc by default (-frtti), the vtable for Derived (_ZTV7Derived) contains a reference to its typeinfo object _ZTI7Derived which in turn contains a reference to the typeinfo object of Base _ZTI4Base:

@_ZTV7Derived = linkonce_odr dso_local unnamed_addr constant { [3 x ptr] } { [3 x ptr] [ptr null, ptr @_ZTI7Derived, ptr @_ZN7Derived3fooEv] }, comdat, align 8, !type !0, !type !1, !type !2, !type !3
@_ZTVN10__cxxabiv120__si_class_type_infoE = external global ptr
@_ZTS7Derived = linkonce_odr dso_local constant [9 x i8] c"7Derived\00", comdat, align 1
@_ZTI4Base = external constant ptr
@_ZTI7Derived = linkonce_odr dso_local constant { ptr, ptr, ptr } { ptr getelementptr inbounds (ptr, ptr @_ZTVN10__cxxabiv120__si_class_type_infoE, i64 2), ptr @_ZTS7Derived, ptr @_ZTI4Base }, comdat, align 8

When this is compiled to a native object, by default it therefore contains a reference to the typeinfo object of base class Base (_ZTI4Base):

$ nm derived.o c++filt
…
U typeinfo for Base
…

However, if we add -fno-rtti, this reference disappears both in the vtable of Derived, which now has a null pointer in that slot as shown below, as well as in the resulting native object.

@_ZTV7Derived = linkonce_odr dso_local unnamed_addr constant { [3 x ptr] } { [3 x ptr] [ptr null, ptr null, ptr @_ZN7Derived3fooEv] }

Solution Alternatives

At a high level, we want a way to perform WPD, and any future whole program class hierarchy analysis optimizations, as broadly as possible with automatic detection and handling of cases where it cannot be applied safely.

Suppressing --lto-whole-program-visibility

Ideally we want to apply –lto-whole-program-visibility automatically in the build system for all statically linked targets, to get the widest benefit from WPD. One solution is to try to detect when we are linking with native objects that might have virtual function overrides and automatically prevent use of this option. However, automatically detecting and preventing use of this linker option when LTO is disabled for source files with potential virtual function overrides is difficult and suboptimal for a few reasons.

First, while the linker invoked during this action does know what native objects are being linked, as described above it is generally always the case that a link involves at least some native libraries/objects as we have system libraries and an occasional assembly file compiled to native objects being linked. So disabling whole program visibility when linking any native code would shut off the WPD refinement of LTO visibility described earlier for all binaries. We would need to maintain an allowlist of native objects that don’t require disabling WPD, which is a brittle and manual solution.

Additionally, even if we could automatically detect that LTO has been disabled for source files containing potential overrides, suppressing --lto-whole-program-visibility on all targets linking that source file would essentially completely disable WPD for all of those binaries (except on statically hidden visibility classes as described earlier), even though only a single virtual function may be impacted. This can result in unexpected or undesirable performance impacts.

Advantages:

  • If technically feasible, would prevent correctness issues by disallowing WPD (and any future optimizations) on binaries linking native code that breaks whole program visibility assumptions

Disadvantages:

  • Brittle as this would require some kind of allowlist for non-problematic native objects
  • Large indescriminate hammer: disables most WPD for all dependent binaries, leading to unexpected/negative performance impacts

Speculative Devirtualization

One possibility is to perform all WPD as a speculative devirtualization (i.e. what we do for profile-based Indirect Call Promotion). Specifically, instead of transforming a virtual call unconditionally to the devirtualized direct call, turn it into a check of the virtual function pointer against the expected direct call target, falling back to the original indirect call in case of a mismatch.

Unfortunately, the performance overhead vs devirtualization without the speculative check was noticeable when this solution was tested.

Advantages:

  • Foolproof solution to this and any future analysis blind spots found
  • Avoids failures due to undefined behavior in code (incorrect casts) which are sometimes difficult to track down and fix

Disadvantages:

  • Performance impact
  • Affects every single devirtualization, as we cannot tell which might be impacted

Fatter Objects

In order to enable visibility into the native object for WPD and any future whole program optimizations, one possibility is to include additional information from the original IR into a special section in the native objects. For ThinLTO, which operates on the summary during the whole program step, this could include all or some of the summary bitcode. For regular LTO, which mostly operates on IR, this would need to include some or all of the IR, unless we change regular LTO whole program based transformations to utilize a summary.

We would need to teach the linker and LTO how to handle this bitcode. We would presumably want lld to do its normal symbol resolution based on the symbols in the ELF portion, but then still hand off the bitcode to LTO. And in LTO we would need to know to treat the bitcode (or its summary, in the case of ThinLTO) as read-only. Specifically, we cannot do any whole program optimizations that require changes (e.g. linkage types) on anything from the fat object’s bitcode. This is likely easier in ThinLTO where we are just dealing with summaries than in regular LTO, which operates directly on IR.

Additionally, for the saved bitcode (summary or IR) to be usable for WPD, we need to change clang to add type metadata to vtables even when not compiling for LTO units (which currently requires an -flto* option). It is possible that this change may require other modifications to avoid side-effects (although initial testing did not surface any issues).

The two extremes of this approach are to include either all of the module (IR for regular LTO / summary for ThinLTO), or to include only what we need for WPD class hierarchy analysis (the type metadata or its summary). These two are discussed in more detail below, but an intermediate option between these two extremes is also possible, and can be driven by whole program analyses needs. This will be touched on later.

Full Module Summary/IR Bitcode Section

At one extreme is a fat object including all of the summary for ThinLTO and all of the IR for regular LTO. The new bitcode section would need to be provided to LTO by the linker when these native objects are LTO linked, to allow visibility into that object’s class hierarchy (for WPD) and other information (for future whole program analyses). While LLVM does not currently have support for fat objects, it was recently proposed.

For handling the missing whole program visibility, creation of these fat objects would only be required for files being compiled to native objects without LTO, that might be linked with LTO objects. However, in many/most build systems it will not be possible to determine what type of objects a file being compiled will be linked with, which means we would likely need to conservatively always produce fat objects for all non-LTO compiles. This has an object file and linker input size cost.

Advantages:

  • Enables LTO to see the original IR (or just the summary for ThinLTO) in cases where LTO is disabled for some objects
  • Allows disabling WPD only when necessary: for overridden virtual functions (leaving it on for the rest of the binary)
  • Supports whole program analysis beyond just WPD
  • Fat objects can be emitted by other compilers (e.g. -ffat-lto-objects in gcc) and were recently proposed for Clang/LLVM

Disadvantages:

  • Expensive from an object file size and linker input standpoint due to limitations in detecting at compile time where the fat objects are required
  • Non-trivial changes to lld and LTO to treat the bitcode specially (albeit likely easier for ThinLTO where we are operating on summaries, as described earlier)

Type Metadata In ELF Section

A more limited version of the above idea is to simply include some additional class hierarchy information in a special section of the native ELF objects. This could be the bitcode summary for just the type metadata and vtable definitions, which would be sufficient for ThinLTO (and regular LTO WPD could also be taught to utilize this summary when available for native objects). The vtable type metadata will expose any overrides of virtual functions in the corresponding native objects.

This would be a much smaller set of additional information to place in the native objects, helping address overhead concerns described above for fat objects.

The linker would need to be taught to provide the bitcode summary from the special section of any native objects to the LTO handling. The LTO side changes, at least for ThinLTO, should be fairly minimal as the vtable type metadata summary information is read-only to ThinLTO WPD analysis, although as described later we may need to teach the other summary based analyses not to try to apply any optimizations to the corresponding vtable definition in the summary.

Similar to fat objects, we will need to ensure type metadata is applied to vtable definitions even for non-LTO compiles.

Advantages:

  • Enables LTO to see any virtual function overrides in native code where LTO was disabled
  • Allows disabling WPD only when necessary: for overridden virtual functions (leaving it on for the rest of the binary)
  • Less expensive to have always-on for source files being compiled without LTO
  • Less invasive changes required in LTO than for fat objects

Disadvantages:

  • Only (initially) addresses class hierarchy analysis related visibility issues (although this could be expanded over time, see below)

Intermediate Solution

We could also start with the minimal amount of summary required for solving WPD and other class hierarchy issues, and expand it is needed for any new whole program analyses in the future. Note that the vast majority of LTO whole program optimizations are symbol resolution driven, and we have all the information required to suitably drive aggressiveness based on the native object symbol tables and the resulting linker symbol resolution information.

Alternate Non-Bitcode Solution

A variation of this suggested by Peter Collingbourne is to add a section that simply lists the strings of all type ids from type metadata in that file into a new (non-bitcode) section. WPD would then need to treat all type tests using that type identifier conservatively (block devirtualization). This makes it easier for other compilers to support emission of this section (e.g. linking with native objects produced with a different toolchain), although we will likely need an escape hatch to handle this case anyway as other compilers will not emit this section in the near term (i.e. the linker can communicate to LTO that it is linking a native object without the required section, and LTO can optionally disable all WPD conservatively). It also simplifies “ld -r” handling (which will concatenate the contents of the same section from the merged objects). With “ld -r” the bitcode would also be concatenated into the merged object section, but some bitcode reader support may need to be added to split them apart or interpret it correctly after merging.

Some of the advantages and disadvantages are the same as for including the type metadata related bitcode as described above, but here are the tradeoffs for listing type id strings in a string section vs type metadata related bitcode:

Advantages:

  • Easier for other compilers to generate this new section
  • Potentially more compact than type metadata summary
  • More natural handling with “ld -r”

Disadvantages:

  • Not (easily) extensible to include additional information needed by TBD future whole program optimizations.
  • Will not be able to determine which virtual functions within the corresponding vtable are overridden (need to conservatively block devirtualization for all virtual calls related to that type). The more fine grained handling could be achieved by including the offsets as well, which then becomes essentially the bitcode summary entries but in a different custom format.
  • Requires more custom WPD handling (vs the type metadata summary merged into the combined index which should fit seamlessly into the existing analysis), although this shouldn’t be too onerous.

Recommendation

In order to provide the most targeted but accurate fix for the visibility issue when disabling LTO for some source files, either fat objects or the more limited type metadata ELF section are ideal. To address overhead and complexity concerns around fat objects, a type metadata ELF section is likely to be the more attractive solution, unless fat objects are supported and enabled in the build for other needs. It is also possible to extend the new ELF section to include additional ThinLTO summary entries as we add new whole program analyses that need to see into the native object code. Using bitcode rather than a custom format is more LLVM-centric, however, simplifies extending the section to contain additional summary info for LLVM’s whole program analysis, and using fat objects when they are enabled in the build.

For the above reasons, I recommend that we pursue the approach of adding type metadata ThinLTO summary bitcode into a special ELF section for non-ThinLTO compiles, and utilizing it when fed to an LTO link.

Design Overview

Clang changes to always add type metadata

As mentioned earlier, type metadata insertion currently requires building for LTO, which would need to be changed. A simple change to clang to suppress this requirement only resulted in a couple of LLVM test failures that would be expected from the change due to type data simply appearing where it hadn’t before.

Definition of new summary bitcode ELF section

LLVM supports wrapping bitcode in a native object file in the .llvmbc section. It’s possible that this section name can be utilized for the summary bitcode, rather than it’s current purpose of holding the full LLVM IR bitcode. However, this section is used for other specialized purposes, such as MLGO, which utilizes the bitcode embedded in this section under the -fembed-bitcode option for non-LTO compiles. Overloading this section to hold some summary bitcode instead of or in addition to full IR bitcode may confuse existing users of the section.

The recent -ffat-lto-objects proposal for LLVM appears to use a new .llvm.lto section in the WIP patch. If fat objects are enabled, we can simply utilize the information from that section, which will include IR and ThinLTO summary. If fat objects are not enabled, one option is to place the minimal summary information required for WPD in a section with the same name, which only requires the linker to look for and handle one special section, regardless of whether that includes partial summary, full summary, or full IR + summary.

Linker changes to pass new type summary section to LTO

Currently, lld and gold ignore the .llvmbc section when LTO linking. They would need to be taught to look at this section (or a new TBD section) in any linked native objects during LTO linking. Specifically, the linker will need to add the summary to the LTO unit via a likely custom version of LTO::add().

LTO changes to utilize vtable type summary

As described earlier, the vtable type metadata summary is read only. The only effect it should have is to expand and complete the set of types descriptions compatible with a given type id, which is sufficient for appropriately guiding any WPD from virtual calls in IR modules. So in theory we should be able to simply add the summary entries into the module summary index and allow WPD to proceed as usual.

A complication is that we will need to include in the summary section the vtable symbol definition summaries. We will need to ensure that these are not used for optimizations like symbol importing or anything requiring a linkage type change. I.e. anything that relies on being able to modify the IR of the associated native object, since we do not have IR at that point. The GlobalValueSummary object already has a NotEligibleToImport flag which we can leverage for this purpose.

1 Like

@davidxl @MaskRay @pcc who have seen a draft version of the proposal

@arda @petrhosek This is related to your fat LTO objects proposal

Tag @ormris as this looks related to [RFC] A Unified LTO Bitcode Frontend

Interesting, can you share how much of a performance overhead this led to? Recently I looked into sites that could be valuable for de-virtualization by examining the IPGO profile for single call-site ICP candidates. The gain would be entirely eliminating these checks but I wasn’t able to get accurate numbers because performance measurement required running the entire service. The entire service execution was not captured in the IPGO profile and thus meant some of these sites were not safe to “experimentally devirtualize”.

Interesting, can you share how much of a performance overhead this led to?

I saw for example a 0.6% increase in cpu time for an important internal application, which is a significant change. At that point we are wiping out any gains from WPD itself, at least on that application, and we essentially can do as well with just ICP alone.

Recently I looked into sites that could be valuable for de-virtualization by examining the IPGO profile for single call-site ICP candidates. The gain would be entirely eliminating these checks but I wasn’t able to get accurate numbers because performance measurement required running the entire service. The entire service execution was not captured in the IPGO profile and thus meant some of these sites were not safe to “experimentally devirtualize”.

Yeah, it will be tricky to measure this with profile alone. Have you tried WPD?

Good to know, that’s considerable for a 100% biased load+compare operation. That suggests that examining these hot single call-site ICP for opportunity to improve WPD and get rid of these checks could be valuable.

With -fwhole-program-vtables -Wl,--lto-whole-program-visibility we see a 0.2% performance improvement for HHVM (GitHub - facebook/hhvm: A virtual machine for executing programs written in Hack.).

It looks like part (most?) of the motivation behind this change is that unchecked assertion of --lto-whole-program-visibility is a ticking bomb in code that uses it. I think this solution is a good way of guaranteeing that WPD is stable on top of build-system or user changes. On a performance level, adding the rest of the summary also similarly stabilizes WPA and allows for more optimizations in the presence of native files.

If I understand correctly, the proposal is to add this section only to object files built for a ThinLTO build – but where ThinLTO is disabled for that particular file? If so, this seems like a very complex addition for a very narrow use-case.

Or, if that’s not correct – if the proposal is actually to emit this metadata into all native object files by default – I’d be quite concerned about that. This seems like a very odd halfway between fat objects and native objects, and I don’t think that’s a good idea. Especially if the format is bitcode – I really don’t think we should be putting bitcode sections into native object files by default. Native object files are a standard format, while bitcode is llvm-only and also has additional cross-version compatibility constraints.

Perhaps as an alternative it would be viable to forbid disabling ThinLTO for individual object files within a ThinLTO --lto-whole-program-visibility build (via throwing an error in the buildsystem, e.g.)? Without knowing exactly the cause of people disabling ThinLTO, it’s hard to say, but ISTM that we ought to be able to come up with a better workaround users can use, which still emits ThinLTO bitcode for the compile, yet allows the user to workaround some problem. (Might be able to forbid cross-TU function imports involving that file, e.g., if that’s a problem.)

I guess in your case you actually have class defined from native libs that is derived from the LTO part, and you want to be able to capitalize on WPD still without asking those native libs to participate in LTO?

The scale of the proposed change seems a bit disproportional to the potential performance headroom here (FWIW, from three of our top workloads, there’s 0.2% from WPD from one, and not measurable from the other two). But opening up metadata/summary in native object seems like a big hammer.

It surely makes things convenient from correctness POV, but we could also argue that if the performance potential is big enough, users should be motivated to have the native part properly participate in LTO? The other thing is that, even though the proposed scheme could be viewed as general approach to open up more LTO optimization in the presence of native lib, there’s perhaps not that much to be gained outside of WPD.

Hi all, thanks for the input. Consolidating my replies to the last few posts since they cover similar questions.

Agreed. It is a solution to make WPD more widely available and robust, and should be extensible for future WPA needs.

I’m not sure it is particularly complex, and the goal is to enable a solution that can be extended in the future to additional WPA for which the basic symbol table info is not sufficient. I can probably prototype something and post some WIP patches to show what would be involved.

We could go full fat bitcode as has been proposed for other needs, but that IMO is overkill for what is needed here and for many future analyses. I don’t think it should be emitted in all native objects by default, although how frequently it needs to be enabled probably depends on build system needs.

Native objects are a standard format, but allow for custom section types (such as for fat objects). One option as mentioned in the proposal is to use a different custom format instead of bitcode, but I’m not convinced that is easier or better overall.

I considered that but it seems unworkable to me because ThinLTO may need to be disabled at the file level for any number of reasons (working around compiler or build system issues). For example if you consider a build system like Bazel, options can be added to BUILD files for individual libraries/source code, which in a monorepo like that used in Google can affect numerous build targets. Completely different projects may be trying to build with ThinLTO. If one dependent source file finds a need to disable ThinLTO for any reason then you can cause multiple build failures with this approach, leading to debugging time across multiple projects and workarounds needed in various build scripts.

Even if we could find a way to automatically just drop --lto-whole-program-visibility in that case, again this means that a workaround in a single file can affect the performance of entire binaries owned by many projects. The goal of this proposal is to avoid these situations, and merely disable affected transformations automatically.

We discourage disabling ThinLTO at the file/library level, and in many cases disabling one specific optimization or another suffices to workaround e.g. compiler bugs. But there are occasional problems that don’t have a good alternate solution.

Correct, to allow WPD to proceed safely on the rest of the binary.

We have seen a range of performance effects from WPD, many are small when instrumentation PGO is used since ICP gets a lot of the benefit. But we have one example that got 7%. And I anticipate more if we can safely roll it out to a longer tail of applications that aren’t so heavily tuned and don’t use application-specific profiles.

Also, the current WPD is quite restrictive. It only performs single implementation devirtualizations, and cannot currently take advantage of the new context after inlining (which can refine the type to one that has a single implementation). We have a list of improvements and other uses for WP class hierarchy analysis. But we need to ensure full safety before adding aggressiveness.

See my response above to @jyknight on the first part. Unfortunately, there are probably always going to be cases that show up from time to time where LTO (thin or regular) needs to be disabled for a file or library. At the worst case this causes a correctness issue. But even if we can detect it we are currently then faced with the option of completely disabling a whole program optimization for all dependent binaries.

Beyond any class hierarchy analysis based optimization (of which WPD is one), I could see this being useful in cases where we want to enable more aggressive whole program optimizations that are currently completely disabled for any symbol that shows up in native objects.

Thanks for clarification. My understanding of the two motivations behind the proposal:

  1. Make it possible to disable LTO on single file as workaround for bugs without losing performance from WPD.
  2. Lower the entry barrier (in terms of visibility and presence of native obj) for some WPO – currently WPD.

For #1, this still feels like a big hammer just to enable users to apply workarounds. But from the perspective of #2, if performance from WPD is meaningful, and it has the potential to deliver more, perhaps it would justify the need to lower the entry barrier and the proposed changes.

I also agree that current WPD is quite limited – single implementation or virtual constant prop are all quite narrow. I was about to suggest that we improve WPD first, by adding inter-procedural type propagation and leveraging inline context as you mentioned for proper devirt to deliver more perf before worrying about its usability. But if improvements are planned, the order of two aren’t necessarily critical.

For the actual change, if fat objects proposal gets implemented, I’m wondering if this can be folded into fat objects. Sure only summary is needed here, but if this is only meant to be used occasionally and temporarily, the benefit of one less variant is a consideration. If we aren’t going to have fat objects, then what you proposed makes sense to me. I don’t have a strong opinion on this though.

But we have one example that got 7%

Is it on top of PGO ICP for large workloads? It’s surprising to see speculative checks alone cost 7% overall perf for large workloads.

Reviving since I’m curious what the current state of the proposal/solutions are since it doesn’t seem like there’s clear consensus.

Additionally, on our (Meta) side we’ve been looking further into whole program devirtualization and found the initial set of measurements did not have the feature properly enabled. Re-evaluating there’s consistent ~0.5% win on some of our core services.

However, we’re seeing a similar issue where native binaries participating in the final link causes safety concerns. In addition, most of the native binaries are not built with Clang and I suspect we’re not the only ones in such a scenario. Given that, has there been thought given to analyzing what types can cross between the IR and native boundary that can make WPD unsafe?

For instance in the example given, we’re getting a Base *B from a native call which could be of a type we don’t know about. Detecting all such sites seems a solvable problem. In addition, doing so also enables a mode where LTO visibility is auto-defined based on the link inputs which could be a nice boost for greater adoption.

| modimo
June 5 |

  • | - |

Reviving since I’m curious what the current state of the proposal/solutions are since it doesn’t seem like there’s clear consensus.

I unfortunately have been too swamped with other priorities to work on this, but would really like to get back to it as we need to address the issue to deploy WPD widely.

Additionally, on our (Meta) side we’ve been looking further into whole program devirtualization and found the initial set of measurements did not have the feature properly enabled. Re-evaluating there’s consistent ~0.5% win on some of our core services.

This is great to hear!

However, we’re seeing a similar issue where native binaries participating in the final link causes safety concerns. In addition, most of the native binaries are not built with Clang and I suspect we’re not the only ones in such a scenario. Given that, has there been thought given to analyzing what types can cross between the IR and native boundary that can make WPD unsafe?

For instance in the example given, we’re getting a Base *B from a native call which could be of a type we don’t know about. Detecting all such sites seems a solvable problem. In addition, doing so also enables a mode where LTO visibility is auto-defined based on the link inputs which could be a nice boost for greater adoption.

Can you give a more concrete example of the approach you are suggesting? As mentioned in the RFC, there are cases where the native object contains a derived class that has no reference to the base class whatsoever. So we cannot rely on normal linker symbol analysis. There are certainly some cases we could detect, but it would not be something we can rely on.

Teresa

Can you give a more concrete example of the approach you are suggesting? As mentioned in the RFC, there are cases where the native object contains a derived class that has no reference to the base class whatsoever. So we cannot rely on normal linker symbol analysis. There are certainly some cases we could detect, but it would not be something we can rely on.

What we’ve been thinking about would be more like a data flow analysis, instead of pure type hierarchy analysis based on vtable references. The main motivation is to avoid requiring native prebuilds to be built with LLVM toolchain, but still have a way to meet the safety requirement for WPD.

If we can identify all unsafe types that can potentially flow through the boundaries between LTO unit and native prebuilds, we can then exclude all types in unsafe type’s hierarchy, and still perform WPD for the rest safely.

If in LTO unit, we have a call B* foo() to native pre-build, then any type in B’s hierarchy should be marked unsafe, because foo could return a B that is an unseen (in LTO) derived type. Similarly, if we have void bar(A*) in LTO but called from native prebuild, A should be also marked unsafe, and types in A’s hierarchy should be excluded in WPD.

So the analysis would come down to 1) identifying boundary of LTO and native prebuild; 2) identifying type that can flow between LTO and native pre-builds boundary. Note that if a type leaks into native prebuilds, member functions of the leaked type becomes newly discovered boundary, which can then cause new type to be identified as crossing the boundary, so it can be an iterative process until things converge. With LTO hidden visibility, the starting point for the boundaries are mostly calls from LTO into pre-builds, so hopefully when the analysis converges, the remaining types should still be very meaningful for WPD to reap benefit.

There are more details that will need to be taken care, specifically around type casts, but maybe we can piggyback on strict-aliasing to handle it reasonably.

Thanks for the details. In theory something like this could work, if we had the necessary type info, but might end up being very conservative.

But have you thought about what information would be needed in the summaries to do this with ThinLTO? That seems like we would need quite a lot of type info. And even in the IR we don’t have pointer parameter types with opaque pointers.

If in LTO unit, we have a call B* foo() to native pre-build, then any type in B ’s hierarchy should be marked unsafe, because foo could return a B that is an unseen (in LTO) derived type.

I think it isn’t just types in B’s class hierarchy that would become unsafe, but also types in the class hierarchy of any pointer members class B contains, and so on (not sure if you were including that in what you describe above).

but might end up being very conservative.

I think it comes down to how large the boundary surface is, and how disjoint types are between LTO and native parts. In cases where native parts are mostly low level libraries (like ours), very likely this is enough for us to prove types in core business logic are safe for devirt. We also need strict-alias rule to handle casts so it won’t be too conservative.

But have you thought about what information would be needed in the summaries to do this with ThinLTO? That seems like we would need quite a lot of type info. And even in the IR we don’t have pointer parameter types with opaque pointers.

Yes, parameter type info need to be kept, and it could be non-trivial for ThinLTO.

I think it isn’t just types in B’s class hierarchy that would become unsafe, but also types in the class hierarchy of any pointer members class B contains, and so on (not sure if you were including that in what you describe above).

You’re right. Though this goes back to questions of how disjoint types are between LTO and native parts.

Even if the base class vtable is not referenced by the native object, by default the typeinfo symbol for the base class will be. However, when compiling with -fno-rtti, typeinfo symbols are not created, so we cannot rely on looking for references to those symbols in native objects.

Another idea we’ve been thinking about, prompted by the above comment in your RFC: maybe it’s more reasonable to require rtti for safety. Requiring type metadata (the recommended approach in the RFC) means native libraries all need to be built by LLVM, which is a strong requirement and it’s likely a deal breaker for many. But if we requires rtti instead, it would not be dependent on specific compiler.

We should be able to detect when rtti isn’t on for everything coming from native prebuilds, i.e. if not all vtables have corresponding rtti, and bail or error out in that case. WDTY?

One way or another I think it’s important to not require native libraries to be built by LLVM in order to do safe devirt.

We should be able to detect when rtti isn’t on for everything coming from native prebuilds, i.e. if not all vtables have corresponding rtti, and bail or error out in that case. WDTY?

That could work if it can be reliably detected and doesn’t conservatively affect too many targets, which might be a good approach for your situation.

One way or another I think it’s important to not require native libraries to be built by LLVM in order to do safe devirt.

We might not end up with a one size fits all solution, unfortunately. I can rely more on everything being built with LLVM than I can on -frtti being enabled for everything. In our case we usually end up in the unsafe WPD situation while compiling with LLVM but needing to disable ThinLTO to deal with corner cases in the build system. We don’t tend to link with precompiled third party libraries other than maybe a handful (and shrinking) of low level libraries.

Since I haven’t had much opportunity to work on the approach described here, I’m considering whether we can leverage the Fat LTO support being completed in D146776 to get the needed information from the embedded summary bitcode for the files, while still linking with the native part.

We might not end up with a one size fits all solution, unfortunately. I can rely more on everything being built with LLVM than I can on -frtti being enabled for everything.

RTTI would only need to be enabled for the non-ThinLTO built binaries to act as a record for the unsafe types. That seems like something that could be reasonably done if the main scenario is ThinLTO being disabled on handful of targets.

RTTI as the source of truth for WPD safety is quite appealing since:

  1. It’s default-on for both Clang and GCC means there’s a high chance it’s already present on pre-built binaries
  2. Is an already existing and well-known data structure for all related tooling

If it’s do-able to use RTTI to enumerate the unsafe types in native files on your side we could have a single unified solution which would be sweet.

Since I haven’t had much opportunity to work on the approach described here, I’m considering whether we can leverage the Fat LTO support being completed in D146776 to get the needed information from the embedded summary bitcode for the files, while still linking with the native part.

Could you elaborate on this idea? This is a use-case I hadn’t considered at all for FatLTO, so I’m quite interested to understand how you envision that it could help w/ WPD.

Could you elaborate on this idea? This is a use-case I hadn’t considered at all for FatLTO, so I’m quite interested to understand how you envision that it could help w/ WPD.

Sure! This is described in some more detail in the Fatter Objects section of the RFC, specifically in the
Full Module Summary/IR Bitcode Section subsection.

Essentially, the idea in my proposal was to allow native objects to contain a section with some amount of ThinLTO summary info, specifically, the type metadata summary which is needed for accurate WPD. At one extreme is including the full IR + summary, which is essentially FatLTO. This is overkill for what we need for WPD, but since your patches are about to go in, gives an essentially ready solution for having this info alongside native objects.

However, we still need some changes, which is to teach lld to provide the bitcode (with summary) to LTO, but with a flag indicating that it is only to be used for analysis (because lld is linking the native part), then some changes to LTO to only include the type metadata info in the combined summary index used for ThinLTO analysis under that flag. Finally, we would need to enable the generation of FatLTO objects for sources that are not being compiled with ThinLTO, in any build with ThinLTO enabled.

Of course, as noted by @WenleiHe and @modimo, this won’t work for precompiled native objects that in particular may be built by other compiler toolchains. However, it does work to address issues where ThinLTO has been disabled at the file/library level to work around various corner cases in the build system or bugs etc.

Ideally we could support both solutions: use RTTI when available in the native object, and FatLTO objects for any where we are disabling ThinLTO but still compiling with LLVM (and further only when the file is compiled with -fno-rtti). There’s still a hole if you are linking with native objects that may not have been built by LLVM+FatLTO and are compiled with -fno-rtti, but there’s probably no way to handle this situation, and ultimately –lto-whole-program-visibility is an assertion from the build system that LTO has whole program visibility - the goal is to maximize the cases where that assertion can be safely applied by a build system so that optimizations like WPD are usable.