Handling of the ELF files missing build-ids?

Greg, Mark,

Looking at the code, LLDB falls back to a full file crc32 to create the module UUID if the ELF build-id is missing. This works, in the sense that the generated UUID does indeed identify the module.

But there are a few problems with this approach:

  1. First, runtime performance: a full file crc32 is a terribly inefficient way to generate a temporary UUID that is basically just used to match a local file to itself.
  • especially when some unstripped binaries can be very large. for example a local chromium build produces a 5.3Gb chrome binary
  • the crc32 implementation is decent, but single-threaded
  • to add insult to the injury, it seems a small bug defeats the intention to cache the hash value so it ends up being recalculated multiple times
  1. The fake UUID is not going to match any external UUID that may be floating around (and yet not properly embedded into the binary)
  • an example is Breakpad, which unfortunately also attempts to make up UUIDs when the build-id is missing (something we’ll hopefully fix soon)

Is there a fundamental reason to calculate the full file crc32? If not I propose to improve this based on the following observations:

A. Model the reality more accurately: an ELF w/o a build-id doesn’t really have an UUID. So use a zero-length UUID in LLDB.
B. The full file name should be enough to prove the identity of a local module.
C. If we try to match an external UUID (ex. from a minidump) with a local file which does not have an UUID it may help to have an option to allow it to match (off by default, and only if there’s no better match)

What do you think?

Thanks,
Lemo.

Greg, Mark,

Looking at the code, LLDB falls back to a full file crc32 to create the module UUID if the ELF build-id is missing. This works, in the sense that the generated UUID does indeed identify the module.

But there are a few problems with this approach:

1. First, runtime performance: a full file crc32 is a terribly inefficient way to generate a temporary UUID that is basically just used to match a local file to itself.
- especially when some unstripped binaries can be very large. for example a local chromium build produces a 5.3Gb chrome binary
- the crc32 implementation is decent, but single-threaded
- to add insult to the injury, it seems a small bug defeats the intention to cache the hash value so it ends up being recalculated multiple times

2. The fake UUID is not going to match any external UUID that may be floating around (and yet not properly embedded into the binary)
- an example is Breakpad, which unfortunately also attempts to make up UUIDs when the build-id is missing (something we'll hopefully fix soon)

Is there a fundamental reason to calculate the full file crc32? If not I propose to improve this based on the following observations:

A. Model the reality more accurately: an ELF w/o a build-id doesn't really have an UUID. So use a zero-length UUID in LLDB.
B. The full file name should be enough to prove the identity of a local module.
C. If we try to match an external UUID (ex. from a minidump) with a local file which does not have an UUID it may help to have an option to allow it to match (off by default, and only if there's no better match)

What do you think?

I am fine with all the above except some reservations about case C. No need to calculate something if it isn't useful. For case C it should be fine to never match as if a file has a UUID to begin with it typically isn't something that gets stripped in a stripped binary. So we should either have it or not. If breakpad does calculate a CRC32, then we need to know to ignore the UUID. The problem is we probably won't be able to tell what the UUID is: real from build ID, or from GNU debug info CRC, or CRC of entire file. So the minidump code will need to do something here. If a minidump has the linux auxv and memory map in them, then we might need to dig through the section information and deduce if a file matches or not based off the size of mapped program headers to further help with the matching.

One other idea is to make a set of enumerations for the UUID type:

class UUID {
  enum class Type {
    BuildID, // A build ID from the compiler or linker
    GNUDebugInfoCRC, // GNU debug info CRC
    MD5, // MD5 of entire file
    MD5NonDebug, // MD5 of the non debug info related bits
    CRC32, // CRC32 of entire file
    Other, // Anything else
  };
};

The eTypeMD5NonDebug is what apple does: it MD5 checksums only the parts of the file that don't change with debug info or any paths found in debug info or symbols tables. So if you build a binary in /tmp/a or in /private/var/local/foo, the UUID is the same if the binary is essentially the same (code, data, etc).

Then we can make intelligent comparisons between UUID types. Might even be possible for a module to have more than 1 UUID then if a binary contains a eTypeBuildID and a eTypeGNUDebugInfoCRC. If a tool stores its UUIDs as a CRC32 or MD5, then those can be calculated on the fly. The GetUUID on lldb_private::Module might become:

const lldb_private::UUID &Module::GetUUID(UUID::Type uuid_type);

Thoughts?

Greg

Hello Leonard,

while I'm in principle supportive of this idea, I think it's not going
to be as easy as you might imagine. There are currently at least two
mechanisms which rely on this crc UUID.

1. .gnu_debuglink separate file pointer
<https://sourceware.org/gdb/onlinedocs/gdb/Separate-Debug-Files.html&gt;\.
This is where the choice of the crc algorithm comes from.

In short, this mechanism for debug info location works like this: The
stripped file contains a .gnu_debuglink section. The section contains
a file path and a crc checksum. After reading this section the
debugger is expected to look for the file at the given path, and then
compute it's checksum to verify it is indeed the correct file (hasn't
been modified).

In LLDB, this is implemented somewhat differently. First we have a
mechanism for assigning UUIDs to modules. This mechanism does one of
two things (I'm assuming here none of the files have proper build-ids
in them). If the file has a .gnu_debuglink section (the stripped
file), we fetch the CRC from there, and assign that as the build-id of
the file. If the file doesn't have this section, we **assume** it is
going to be a target of gnu_debuglink pointer in another file, compute
its CRC via the specified algorithm, and assign that as the build-id.

Second, we have a mechanism for locating the unstripped file. This
just fetches the path from gnu_debuglink, and compares the UUIDs of
the two modules. This works because of the UUID algorithm in the first
part.

Now, this is not a particularly smart way of doing things and it is
the cause of the most of your problems. However, it is completely
avoidable. Instead of piggy-backing on the UUID mechanism, we should
just change the search mechanism (the second part) to compute the crc
itself, and only after it successfully finds the file referenced by
the gnu_debuglink section. This way, we will only compute the crc only
when absolutely necessary (i.e. for your use case, never).

2. Currently, for remote debugging, we assume that each module has
some sort of a unique identifier which we can check to see whether we
have downloaded that file already (see qModuleInfo packet). I am not
sure what would happen if we suddenly just stopped computing a UUID
for these files, but we at the very least lose the ability to detect
changes to the remote files. For this item, I am not sure what would
be the best course of action. Maybe we should just start relying on
modification timestamp for these files?

Greg, Mark,

Looking at the code, LLDB falls back to a full file crc32 to create the module UUID if the ELF build-id is missing. This works, in the sense that the generated UUID does indeed identify the module.

But there are a few problems with this approach:

1. First, runtime performance: a full file crc32 is a terribly inefficient way to generate a temporary UUID that is basically just used to match a local file to itself.
- especially when some unstripped binaries can be very large. for example a local chromium build produces a 5.3Gb chrome binary
- the crc32 implementation is decent, but single-threaded
- to add insult to the injury, it seems a small bug defeats the intention to cache the hash value so it ends up being recalculated multiple times

2. The fake UUID is not going to match any external UUID that may be floating around (and yet not properly embedded into the binary)
- an example is Breakpad, which unfortunately also attempts to make up UUIDs when the build-id is missing (something we'll hopefully fix soon)

Is there a fundamental reason to calculate the full file crc32? If not I propose to improve this based on the following observations:

A. Model the reality more accurately: an ELF w/o a build-id doesn't really have an UUID. So use a zero-length UUID in LLDB.
B. The full file name should be enough to prove the identity of a local module.
C. If we try to match an external UUID (ex. from a minidump) with a local file which does not have an UUID it may help to have an option to allow it to match (off by default, and only if there's no better match)

I think we might have something already which could serve this
purpose. Eugene added a couple of months ago a mechanism to force-load
symbols for a given file regardless of the UUIDs
<Login. It requires an explicit "target
symbols add" command (which seems reasonable, as lldb has no way to
tell if it's doing things right). Would something like that work for
you?

cheers,
pl

I am fine with all the above except some reservations about case C. No need to calculate something if it isn’t useful. For case C it should be fine to never match as if a file has a UUID to begin with it typically isn’t something that gets stripped in a stripped binary. So we should either have it or not. If breakpad does calculate a CRC32, then we need to know to ignore the UUID. The problem is we probably won’t be able to tell what the UUID is: real from build ID, or from GNU debug info CRC, or CRC of entire file. So the minidump code will need to do something here. If a minidump has the linux auxv and memory map in them, then we might need to dig through the section information and deduce if a file matches or not based off the size of mapped program headers to further help with the matching.

One other idea is to make a set of enumerations for the UUID type:

class UUID {
enum class Type {
BuildID, // A build ID from the compiler or linker
GNUDebugInfoCRC, // GNU debug info CRC
MD5, // MD5 of entire file
MD5NonDebug, // MD5 of the non debug info related bits
CRC32, // CRC32 of entire file
Other, // Anything else
};
};

The eTypeMD5NonDebug is what apple does: it MD5 checksums only the parts of the file that don’t change with debug info or any paths found in debug info or symbols tables. So if you build a binary in /tmp/a or in /private/var/local/foo, the UUID is the same if the binary is essentially the same (code, data, etc).

Then we can make intelligent comparisons between UUID types. Might even be possible for a module to have more than 1 UUID then if a binary contains a eTypeBuildID and a eTypeGNUDebugInfoCRC. If a tool stores its UUIDs as a CRC32 or MD5, then those can be calculated on the fly. The GetUUID on lldb_private::Module might become:

const lldb_private::UUID &Module::GetUUID(UUID::Type uuid_type);

Thoughts?

Greg

I like the idea of UUID sub-types! This solves the problem is a more generic fashion and it’s also extensible. Interestingly enough, for Crashpad we’re considering something similar (the fabricated UUIDs would have a different CvRecord signature)

Case C. is a bit ugly so let me elaborate: this is specific to Breakpad minidump + ELF binaries w/o build-id. So we’d still need to have a way to force the match of the modules in the minidump with the local files. This ought to be an off-by-default, sharp tool which you’d only need to deal with Breakpad minidumps, and even then it would only be a fall-back that must be explicitly requested.

  1. .gnu_debuglink separate file pointer
    <https://sourceware.org/gdb/onlinedocs/gdb/Separate-Debug-Files.html>.
    This is where the choice of the crc algorithm comes from.

Thanks Pavel. As you noted yourself, this doesn’t mean that the UUID has to be tied to the checksum (they are exclusive options). In particular, for performance reasons I think it’s desirable to avoid calculating checksums for every ELF module just in case.

I think we might have something already which could serve this
purpose. Eugene added a couple of months ago a mechanism to force-load
symbols for a given file regardless of the UUIDs
<https://reviews.llvm.org/D35607>. It requires an explicit “target
symbols add” command (which seems reasonable, as lldb has no way to
tell if it’s doing things right). Would something like that work for
you?

Nice. We may have to update the C++ API, but something like this would do the trick for case C.

To summarize the conversation so far:

  1. We can fix cases A, B independent of C: if real UUIDs are missing, don’t automatically use full file CRC32 as UUID.
  2. Pay attention to make sure that we don’t break .gnu_debuglink or remote debugging (thanks Pavel)
  3. Multiple types/namespaces for UUIDs would be a great feature!
  4. Eugene’s symbol forcing trick could be extended to handle case C

Did I miss anything?

My current plan is to start with #1, then look into #4 (force symbols match).

Sounds like a plan.