RFC: CodeView debug info emission in Clang/LLVM

David Majnemer and David Blaikie (this is seriously like the attack of the Daves) probably have some thoughts, but currently there is code in llvm-readobj.exe to parse certain types of codeview from object files (mostly line table information).

So one idea is to generate some object files that have CV records you know about up front, then pass the output of llvm-readobj (which would need to be updated to use LLVMCodeView instead of hand-rolled parsing) to FileCheck and verify that it matches some pattern.

Maybe the Daves have some other ideas as well.

Yep, this would test the dumping behavior - and how we test llvm-dwarfdump.
I assume you have similar checked-in-binary tests for llvm-pdbdump?

How you get output to dump is a bit fuzzy in this case (we don't have much
test coverage like this particular situation) - one way is to create
another textual format (json, etc), read it, generate CV from it, dump it,
FileCheck it, but that's a bit heavyweight.

I'd be inclined to write unit tests if possible - use the CV APIs directly
in a unit test, generate in-memory CV output to feed into the dumper
in-process, if possible (or, if necessary/substantially more convenient,
have the unit test actually write CV output, dump, check)

Hmm, yeah, not perfect, though - how do you check the dumped output?
FileCheck is our usual tool for this & there's, again, probably no great
in-process story for that...

Open to further ideas...

- Nth Dave

For reading DWARF we currently have two sort of redundant implementations in the overall LLVM project — one in lib/DebugInfo and another one in LLDB. Do you see an opportunity for sharing the PDB implementation between LLVM and LLDB?

-- adrian

Yes the PDB implementation will absolutely be shared.

I’m not responsible for the DWARF reading code in LLDB, but my understanding is that it is the way it is because they want to load a lot of debug info lazily and so their reader is optimized for that use case. I don’t know, I personally think if one had the will and the knowledge, that they could drive a change to LLDB to use LLVM’s DWARF reading code, making changes to LLVM’s implementation along the way to make sure that the performance characteristics remain the same. I would love it if someone did that.

I do not think that “FileChecksums” mentioned in CodeView patch (http://reviews.llvm.org/D14209) is the same thing as in DWARF 5.

In DWARF 5 the checksum is for the generated debug info sections, like debug_info, .debug_macro, etc. Thus, it make sense to do it in the codegen (debug emitter)

In CodeView, I believe the checksum is for the source file, which make more sense to calculate it in Clang.

Dave, might be able to explain it for us.



For type units (a DWARF 4 feature) there is a “signature” computed with MD5, which is of course computed by LLVM as it creates the units.

In DWARF 5 there is provision in the line table for providing an MD5 checksum of the source file (instead of the file size and modtime characteristics), which is exactly what you’ve described. Yes this needs to be calculated in Clang and passed down to LLVM through the metadata.

Thanks for verifying!


Circling back around 4 months later...

I now believe that we should just let the frontend generate CV type info.
It's really not worth the hassle to try to have a common representation.
Enough C++ ABI-specific information leaks into the format that it's really
better to avoid trying to create a union of DWARF and CV type info in LLVM
DI metadata. We were able to reuse all the other non-type DI metadata, such
as location info and scope info, to emit inline line tables and variable
locations, so I think we did OK on reusing the existing infrastructure.
Compromising at not reusing the type representation seems OK.

I haven't come up with any ideas better than the design that Dave
Bartolomeo outlined below, so I think we should go ahead with that. One
thing I considered was extending DITypeRef to be a union between MDString*,
DIType*, and a type index, but I think that's too invasive. I also don't
want to make a whole DIType heap allocation just to wrap a 32-bit type
index, so I'm in favor of putting the indices into DISubprogram and

Any thoughts on this plan?

I think it’d be reasonable to at least figure out a good way to do type references consistently across the two schemes, but I’m OK with the idea of having a blob of opaque type information for different debug info formats, created by frontends (& don’t mind if the library for building that blob live in LLVM or Clang for now - the DWARF one at least would probably live in LLVM because type info and other DWARF are described by similar/the same constructs (DIEs, abbrevs, etc) - but it seems like that’s not the case for PDB, so there might not be any code to share between LLVM’s CodeView needs and the type info construction - then it’s just a matter of whether pushing that library down into LLVM for other frontends to use would be good, which it probably will be at some point, so if it goes into Clang I’d at least try to keep it pretty well separated)

Potentially that consistency could be created by going the other way - replace DITypeRef with an int, then have the retained types list be the int->type mapping. Skipping the mangled names. (& skip the retained types list for CV/PDB)

  • Dave


I said it before and I am saying it again, I do not think that this proposal is needed to support Codeview.

  1. Why cannot Codegen make use of current DIType metadata to represent the codeview types?

  2. Why cannot “DW_TAG_typedef” be used to generate the “DICodeViewUDT” symbol?

  3. Why do we need the TypeIndex?

· DISubprogram and DIVariable simply point to the DIType metadata, instead of having an index into an array where these DIType are stored?!

  1. Why the “TypeRecords” are of type MDString? Are they the source name of the type?

I believe that current Debug Info metadata contains all information needed to create the codeview information in codegen.

Thus, I do not see a need to either modify Clang or even modify the LLVM IR.

Please, if you have a concrete case where you think we have lost information needed for codeview between Clang and Codegen, tell us about it and I will be happy to help you figure out how to retrieve this information from current DI metadata.



In general, I agree here. I’m still unconvinced that this needs to happen this way.


DITypeRef wraps a Metadata*, though, not an int. Given that there are zero
users of DITypeRef in Transforms/ and Analysis/, I don't see why we should
try to forcibly create sharing where there is none. The only consumers of
type information are essentially the separate debug info backends.

I haven't looked in detail at the patch - but it sounded like the proposal
was to add an int field next to every DITypeRef field? That seems
verbose/intrusive to the schema compared to making the type reference
machinery able to be one or the other (or is the proposal to have DITypeRef
fields be a union of int or DITypeRef (then the DITypeRef itself is a union
of metadata reference or string)? If we already have a union of metadata or
string, it seems like the better thing to do would be to make it metadata,
string, or int rather than having two different layers for referring to

It is certainly *possible* to use the existing DIType hierarchy to generate
CodeView, but I don't believe it is useful. We would have to make the DI
metadata into the union of DWARF and CodeView, and it would be horrible.
Here is an incomplete list of things that would be awkward:

- Member pointer inheritance models. Not all pointers to members are the
same size.
- Describing locations of virtual bases in vbtables. I'm not sure how to
get from DW_TAG_inheritance data to "offset of vbptr from vfptr of complete
- Describing 'this' adjustments performed in virtual method prologues.
- New virtuality types to indicate "introducing" virtual methods.
- New flags on everything, see CodeView.h for more info.

If you need more visibility into what's different, consider this C++ source:

struct A {
  virtual void f() {}
  int a;
struct B : virtual A {
  virtual void f() {}
  virtual void g() {}
  int b;
struct C : virtual A {
  virtual void f() {}
  virtual void h() {}
  int c;
struct D : B, C {
  virtual void f() {}
  virtual void g() {}
  virtual void h() {}
  int d;
D d;
auto mp = &D::f;

Compare the metadata that clang generates with the dump of the codeview
that MSVC generates, and decide for yourself if the representations are a
good match:
$ clang -cc1 -std=c++11 -emit-llvm -debug-info-kind=limited t.cpp -o -
-triple x86_64-linux -o t.ll
LLVM IR: Spectre
$ cl -c t.cpp -Z7 && llvm-readobj -codeview t.obj
Dump of MSVC CodeView: Spectre

Sure, yes, it is *possible* to write a converter from one to the other, but
why is it necessary? What use case does it enable?

You might think it would allow non-Clang frontends to avoid having separate
type info emitters, but in practice it won't, because these frontends will
need to be augmented to pass down all kinds of CV-specific junk.

Hi All,

Reid, Dave and I have chatted about this quite a bit and I think we have a way forward that gets us in a direction we’d like to go, offers some potential performance benefits for existing dwarf users, and maintains some compatibility while transitions are happening. We’re currently writing up a proposal and will send it out for RFC shortly.