> Hi! A few of us over at Google think a nice feature in clang would be
ODR violation checking, and we thought for a while about how to do this and
wrote it down, but we aren't actively working on it at the moment nor plan
to in the near future. I'm posting this to share our design and hopefully
save anyone else the design work if they're interested in it.
> For some background, C++'s ODR rule roughly means that two definitions
of the same symbol must come from "the same tokens with the same
interpretation". Given the same token stream, the interpretation can be
different due to different name lookup results, or different types through
typedefs or using declarations, or due to a different point of
instantiation in two translation units.
> Unlike existing approaches (the ODR checker in the gold linker for
example), clang lets us do this with no false positives and very few false
negatives. The basis of the idea is that we produce a hash of all the
ODR-relevant pieces, and to try to pick the largest possible granularity.
By granularity I mean that we would hash the entire definition of a class
including all methods defined lexically inline and emit a single value for
> The first step is to build a new visitor over the clang AST that
calculates a hash of the ODR-relevant pieces of the code. (StmtProfiler
doesn’t work here because it includes pointers addresses which will be
different across different translation units.) Hash the outermost
declaration with external-linkage. For example, given a class with a method
defined inline, we start the visitor at the class, not at the method. The
entirety of the class must be ODR-equivalent across two translation units,
including any inline methods.
> Although the standard mentions that the tokens must be the same, we do
not actually include the tokens in the hash. The structure of the AST
includes everything about the code which is semantically relevant. Any
false positives that would be fixed by hashing the tokens either do not
impact the behaviour of the program or could be fixed by hashing more of
the AST. References to globals should be hashed by name, but references to
locals should be hashed by an ordinal number.
> Instantiated templates are also visited by the hashing visitor. If we
did not, we would have false negatives where the code is not conforming due
to different points of instantiation in two translation units. We can skip
uninstantiated templates since they don’t affect the behaviour of the
program, and we need to visit the instantiations regardless.
> In LLVM IR, create a new named metadata node !llvm.odr_checking which
contains a list of <mangled name, hash value> pairs. The names do not
necessarily correspond to symbols, for instance, a class will have a hash
value but does not have a corresponding symbol. For ease of implementation,
names should be mangled per the C++ Itanium ABI (demanglable with c++filt
-t). Merging modules that contain these will need to do ODR checking as
part of that link, and the resulting module will have the union of these
> In the .o file, emit a sorted table of <mangled name, hash value> in a
non-loadable section intended to be read by the linker. All entries in the
table must be checked if any symbol from this .o file is involved in the
link (note that there is no mapping from symbol to odr table name). If two
.o files contain different hash values for the same name, we have detected
an ODR violation and issue a diagnostic.
> Finally, teach the loader (RuntimeDyld) to do verification and catch
ODR violations when dlopen'ing a shared library.
This is the right basic design, but I'm curious why you're suggesting
that the payload should just be a hash instead of an arbitrary string.
What are you suggesting goes into this string?
The same sorts of things that you were planning on hashing, but maybe not
hashed. It's up to you; having a full string would let you actually show a
useful error message, but it definitely inflates binary sizes. If you
really think you can make this performant enough to do on every load, I can
see how the latter would be important.
I was thinking we could add more things to help diagnostics, next to the
hash. I *think* there are two cases that matter, but there may be more.
Either we have an ODR violation where the file:line are different, or if
file and line are the same then the preprocessor state was different. We
could emit file and line from the starting loc of each of the hashes, and
we could emit a preprocessor table with the list of initial defines and
changes to those defines as the TU went along -- at each hash we could
point to an index into that table to indicate where we are. Both of those
give us enough information for the linker to say why the ODRs failed to
Do you have any idea how much data that is? Your .o files are going to be
The point of the design is to minimize the number of entries, so adding
more data per-entry doesn't seem like a big problem.
That said, I haven't implemented even an experiment that counts how many
entries we would produce. I can do that and come back with numbers,
possibly with my tail between my legs.
There are other situations where the file, line and preprocessor
settings were the same. Then I'd expect the md5 of the file was different,
and the file was changed between two builds. Anything beyond that is pretty
If the diagnostics you want the linker to report are literally as coarse
as “this file changed, maybe that has something to do with it?”, I think
you might be able to find something that doesn’t require quite so much of a
This isn't going to be performant enough to do unconditionally at every
load no matter how much you shrink it.
Every load of a shared object? That's not a fast operation even without
odr checking, but the idea is to keep the total number of entries in the
odr table small. It's less than the number of symbols, closer to the number
of top-level decls.
Your ABI dependencies are every declaration *that you ever rely on*.
You've got to figure that that's going to be very large. For a library of
any significance, I'd be expecting this check to touch about half a
megabyte of data, even with a 32-bit hash and some sort of clever prefixing
scheme on the symbols. That's a pretty major regression in library loading.
Fair point. If we want to include less in the hash just to make it more
palatable for dynamic library users, we can have a flag for that. Some sort
of ODR-checking lite.
I really don't care about .so files myself.
Maybe you should just spec this out as a static-linker-only technology,
That's fair criticism. I did exactly that, then was asked to please
consider the dynamic linker case and lazily said "oh sure, I don't see any
reason the same scheme can't verify at load time too". I'll retract any
claims about handling the dynamic loading case.
Also, you should have something analogous to symbol visibility as a way
to tell the static linker that something only needs to be ODR-checked
within a linkage unit. It would be informed by actual symbol visibility,
Great point, and that needs to flow into the .o files as well. If a class
has one visibility and its method has another, we want to skip the method
when hashing the class, and need to emit an additional entry for the method
alone? Is that right?
Class hashes should probably only include virtual methods anyway, but
yes, I think this is a good starting point.
What do you want in the hash for a function anyway? Almost everything is
already captured by (1) the separate hashes for the nominal types mentioned
and (2) the symbol mangling. You're pretty much only missing the return
type. Oh, I guess you need the body's dependencies for inline functions.
On the contrary, this is the case I care about! Two different definitions
of the same function.
Allow me to suggest a structural hash of the statements, then, instead of
dumping the entire preprocessor history.
But I am suggesting a structural hash of the statements!
Consider a system that produces a structural hash of the statements. We
should do this with something that behaves similarly to the StmtProfiler,
but with specialized handling for things like labels and global variables.
Suppose we get a hash for each function, and emit a table of hashed
function name to hashed AST. Great! The basis of an ODR checker.
But can we do better? Instead of storing a hash for each function, we can
group them (such that we produce a single hash for more than one function),
for instance by class they're in when the methods were written in the body?
What I proposed is to have a single has for each top-level-declaration and
all of its lexically-included contents, dramatically reducing the number of
hashes we need to store. The tradeoff is that we can't quickly map from
symbol-name to an entry in the ODR table. That in turn means need to check
all ODR entries in a .o if we bring in any symbol from that .o file (we
can't just check the individual function because it may not have been
You were asking why we should hash things when we could just store the
source code. I don't see how storing the text of the .cc would be better
than storing a condensed version of the changes we care about (the
preprocessor history). It's a strict subset, and if we really care, we can
shrink it further by being clever (collapse multiple changes between the
start of two top-level-decls, don't include changes in macros that weren't
expanded during the parse of the body of a top-level-decl). The sort of
output I'm expecting would be along the lines of "a.o was produced with
_DEBUG while b.o was produced with NDEBUG".