A sufficient test for GV unification via unnamed_addr ?

Does anyone know of a good test to estimate whether or not a pair of GlobalVariables could potentially be unified due to the “unnamed_addr” flag? This is for an out-of-source AA project, and I need to err on the side of assuming unification is possible. But for precision reasons, I’d really like to avoid false positives.

I’m having trouble understanding just how equivalent two GV’s initalizers must be for the linker to be allowed to unify them. Here’s what the docs say:

Global variables can be marked with unnamed_addr which indicates that the address is not significant, only the content. Constants marked like this can be merged with other constants if they have the same initializer. Note that a constant with significant address can be merged with a unnamed_addr constant, the result being a constant whose address is significant.

From that wording, I’m having trouble figuring out just how similar two GVs’ initializers must be before the linker is considered free to unify the GVs’ storage. I’ve got a few theories, but would appreciate any suggestions. I’m hoping for an overall test which is both precise, and not too computationally intensive on a program with very many globals.

  • Theory 1: At the LLVM C++ API level, GV1 and GV2 can only be unified if their initializer is the very same API object. I.e., “GV1->getInitializer() == GV2->getInitializer().” For this to be a sufficient test, I think there would need to be some strong promises by the C++ API implementation regarding using a single object to represent equal or equivalent initial values.

  • Theory 2: The linker requires that the initializers for GV1 and GV2 are syntactically equivalent compile-time constants, but their initializers might not be described using the same llvm::Constant object. For example:

@GV1 = private unnamed_addr constant [4 x i8] c"Foo\00", align 1

@GV2 = private constant [4 x i8] c"Foo\00", align 1

  • Theory 3: Type-safe compile-time-constant semantic equivalence, but unlike Theory 2, allows for syntactically alternative representations. All that matters is that two initializer objects are equivalently typed, equivalently shaped, and ultimately have equivalent constituent scalar values.

@GV1 = unnamed_addr constant [4 x i32] zeroinitializer, align 16
@GV2 = constant [4 x i32] [i32 0, i32 0, i32 0, i32 0], align 16

  • Theory 4: Arbitrary compile-time-constant bit-pattern equivalence. For example:

@X = constant i32 -1, align 4
@Y = unnamed_addr constant [4 x i8] c"\FF\FF\FF\FF", align 1

Note: I’m using LLVM’s 3.7’s C++ API. The target program will ultimately be linked on modern x86-64 Linux system, probably using Gnu ld. The program is compiled with clang or clang++, and in some cases I’ve used “llvm-link” to combine the target-program bitcode files into a single module. My analysis only considers a single bitcode file in isolation.