LLDB does not support the default 8 byte build ID generated by LLD

Hey guys,

LLDB uses source/Utility/UUID.cpp to store the build ID. This class only supports 16 or 20 byte IDs.

When parsing the .note.gnu.build-id ELF section, any build ID between 4 and 20 bytes will be parsed and saved (which will silently fail if the size isn’t 16 or 20 bytes) https://github.com/llvm-mirror/lldb/blob/4dc18b8ce3f95c2aa33edc4c821909c329e94be9/source/Plugins/ObjectFile/ELF/ObjectFileELF.cpp#L1279 .

I discovered this issue because by default LLD will generate a 8 byte build ID, causing LLDB to ignore the .note.gnu.build-id ELF section and compute a crc32 at runtime.

Is this a know issue that somebody is already working on? (After a quick search I couldn’t find any open bugs with a similar description).

Does anybody have any objection to modifying UUID::SetBytes to accept any byte array with a size between 4 - 20 bytes, and pad with zeros to the next largest supported size (either 16 or 20 bytes).

ex.
Setting a UUID with length of 8, would pad with 8 trailing zeros to have an overall length of 16.
Setting a UUID with length of 17, would pad with 3 trailing zeros to have an overall length of 20.

Thanks,
Scott

Thanks for the heads up Scott. I was not aware that lld generates
build-ids with different length.

Padding would be one option (we already do that to handle the crc
pseudo-build-ids), but perhaps a better one would be to teach the
class how to handle arbitrary-sized UUIDs (or up to 20 bytes, at
least).

I don't think there's a fundamental reason reason why only these two
lengths are acceptable. The class originally supported 16 bytes only,
because that's how mac UUIDs look like. Then, later, when we were
bringing up linux, we added 20-byte support because that's what the
gnu linkers generated. But, as it seems that this field can be of any
size, maybe it's time to teach UUID how to handle the new reality.

Have you looked at how hard would it be to implement something like that?

pl

I took a quick look, it should be fairly straight forward. The one wrinkle (which is just a design decision) is how to represent the variable length UUID as a human readable string (Ex. 16 byte UUIDs are represented as xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx).

I guess the one thing that is giving me pause, is that according to the spec UUIDs are only supposed to be 16 bytes. UUID.cpp already isn’t strictly to spec because 20 byte IDs are already supported, but it seems like supporting arbitrary length IDs is an even further departure from the spec. Maybe this is just semantics and doesn’t really matter. I don’t have a strong opinion one way or another, you definitely have more context than me, and if you think using arbitrary length IDs makes more sense than padding we can move forward with that solution.

I had made a local attempt at making UUID support arbitrary sizes (part of extracting the UUIDs from minidumps). I ended up abandoning the UUID changes since there were not strictly in scope and I also had the same uneasy feeling about how flexible do we really want to be with UUIDs.

Overall, the change was aesthetically pleasing since the UUID interface can be cleaned up a bit, but there are a few small downsides I remember:

  1. A variable-length UUID likely incurs an extra heap allocation.
  2. Formatting arbitrary length UUIDs as string is a bit inconvenient as you noted as well.

I may have an old patch with these changes, let me dig a bit.

I had that issue a while back and uploaded a few diffs that fix the problem, but never landed them. I don’t remember exactly what needed to change, but you can view them here:
https://reviews.llvm.org/D40538
https://reviews.llvm.org/D40537

Let me know if you want to help get them committed to the tree.

From the looks of it, the patch stalled on the part whether we can

consider all-zero UUIDs as valid or not. I've dug around the code a
bit now, and I've found this comment in ObjectFileMachO.cpp.

           // "main bin spec" (main binary specification) data payload is
           // formatted:
           // uint32_t version [currently 1]
           // uint32_t type [0 == unspecified, 1 ==
kernel, 2 == user process]
           // uint64_t address [ UINT64_MAX if address not specified ]
           // uuid_t uuid [ all zero's if uuid not specified ]
           // uint32_t log2_pagesize [ process page size in log
base 2, e.g. 4k pages are 12. 0 for unspecified ]

So it looks like there are situations where we consider all-zero UUIDs
as invalid.

I guess that means we either have to keep IsValid() definition as-is,
or make ObjectFileMachO check the all-zero case itself. (Some middle
ground may be where we have two SetFromStringRef functions, one which
treats all-zero case specially (sets m_num_uuid_bytes to 0), and one
which doesn't). Then clients can pick which semantics they want.

1. A variable-length UUID likely incurs an extra heap allocation.

Not really. If you're happy with the current <=20 limit, then you can
just use the existing data structure. Otherwise, you could use a
SmallVector<uint8_t, 20>.

2. Formatting arbitrary length UUIDs as string is a bit inconvenient as you noted as well.

For the string representation, I would say we should just use the
existing layout of dashes (after 4, 6, 8, 10 and 16 bytes) and just
cut it short when we have less bytes. The implementation of that
should be about a dozen lines of code.

The fact that these new UUIDs would not be real UUIDs could be solved
by renaming this class to something else, if anyone can think of a
good name for it (I can't). Then the "real" UUIDs will be just a
special case of the new object.

pl

Here’s a snapshot of the old changes I had: https://reviews.llvm.org/D48381
(hopefully it helps a bit but caveat emptor: this is a quick merge from an old patch, so it’s for illustrative purposes only)

I am fine if we go with any number of bytes. We should have the lldb_private::UUID class have an array of bytes that is in the class that is to to 20 bytes. We can increase it later if needed. I would rather not have a dynamically allocated buffer.

That being said a few points:

  • Length can be set to zero to indicate invalid UUID. Better that than filling in all zeroes and having to check for that IMHO. I know there were some problems with the last patch around this.
  • Don’t set length to a valid value and have UUID contain zeros unless that is a true UUID that was calculated. LLDB does a lot of things by matching UUID values so we can’t have multiple modules claiming to have a UUID that is filled with zeroes, otherwise many matches will occur that we don’t want
  • 32 bit GNU debug info CRCs from ELF notes could be filled in as 4 byte UUIDs
  • Comparing two UUIDs can start with the length field first the if they match proceed to compare the bytes (which is hopefully what is already happening)

That sounds like a plan. I have started cleaning up the class a bit
(removing manual uuid string formatting in various places and such),
and then I'll send a patch which implements that.

Leonard, I'm not going to use your patch, as it's a bit un-llvm-y
(uses std::ostream and such). However, I wanted to check whether 20
bytes will be enough for your use cases (uuids in minidumps)?

Leonard, I’m not going to use your patch, as it’s a bit un-llvm-y
(uses std::ostream and such). However, I wanted to check whether 20
bytes will be enough for your use cases (uuids in minidumps)?

For minidumps we normally use either 16 or 20 byte UUIDs, so I don’t see any immediate problems. Are you planning to make 20 a hard limit or have the 20 bytes “inlined” and dynamically allocate if larger?

Leonard, I’m not going to use your patch, as it’s a bit un-llvm-y
(uses std::ostream and such). However, I wanted to check whether 20
bytes will be enough for your use cases (uuids in minidumps)?

For minidumps we normally use either 16 or 20 byte UUIDs, so I don’t see any immediate problems. Are you planning to make 20 a hard limit or have the 20 bytes “inlined” and dynamically allocate if larger?

We could use a llvm::SmallVector<uint8_t, 20> to have up to 20 bytes before going larger and allocating on the heap.

I didn’t know about llvm::SmallVector, but it seems a good match indeed, thanks Greg.