Debug Info Generation in Clang.

Hi,

I would like to use Clang front-end for our toolchain.

One feature that we would like to see in Clang is the source level debugging information generation.

What is the latest state of clang debug info generation? Is anything being done about it?

If yes, how can I contribute?

If not, I would like to take that up. What would be a good starting point?

Thanks,

Sanjiv

I would like to use Clang front-end for our toolchain.
One feature that we would like to see in Clang is the source level debugging information generation.

Great!

What is the latest state of clang debug info generation? Is anything being done about it?

Right now, noone has started working on it. This should be a relatively easy project to hook up though.

If not, I would like to take that up. What would be a good starting point?

As Shakti pointed out, this is the canonical documentation on the debug info:
http://llvm.org/docs/SourceLevelDebugging.html

In practice though, it would be useful to write some simple testcases and send them through llvm-gcc to see what code is generated. You can also see similar code in llvm-gcc for turning GCC AST’s into LLVM Debug info (e.g. for types) whose logic should translate over pretty directly.

To get started, I’d suggest beginning with simple line number information. When you can emit the llvm.dbg.stoppoint intrinsics (and the metadata that describes translation units etc) you should be able to step through a program with GDB. After that, I’d tackle variable/function descriptors, type descriptors, etc.

Please let me know if you run into any tricky parts, this would be a great contribution to clang!

-Chris

Oh, great! I'm looking forward to program my robot using clang :stuck_out_tongue: (it has a PIC16F876 processor)
Nuno

Hi,

I did some (very) little work on this in november, but didn’t had time to get patch out. Some bit of information you could find interesting:

A mail from cfe-dev (response from chris):

I started coding and have a few functionalities working but I hesitate

between two possible implementations.

What I started doing is:

  • Add members and accessor functions to CodeGenModule and

CodeGenFuntion.

  • Insert code directly in the code generating function of these class,

conditionnaly activated on a flag:

I’d suggest adding a new class, e.g. CodeGenDebugInfo that holds the debug related information. This is the model that llvm-gcc uses it it’s llvm-convert.cpp and llvm-debug.cpp files. The nice thing about this is that you can then do stuff like:

void CodeGenFunction::GenerateCode(const FunctionDecl *FD) { […]

// Create subprogram descriptor.

if (DebugInfo)

DebugInfo->CreateSubProgramDesc(…);

The nice thing about this model is that it keeps the debug info emission code localized in one file, while making the hooks into it obvious.

Another mail, I wrote to llvm (without answer):

From the implementation of DISerialiser, It seems to me I need to keep all my debug information object alive until the end of the compilation unit (until the destruction/last use of the instance of DISerialiser). For example, I would need to keep all the SubprogramDesc for all the functions from the translation unit. Have I correctly understood?

(the problem is that the map of the serialized data is keyed by the address of the descriptor. If the descriptor is deleted, a new (and different) descriptor could take the same address and cause an hard to find bug. As far As I can see, the data isn’t acceded after being serialized, so if not for the previous point, it would be safe to delete the descriptor after use/serialization).

This make the lifetime management of the object harder. I had a patch for changing this behavior, allowing us to delete the descriptor once serialized, but nobody seemed interested.

And attached a first try to add function definition debug information. It is out of date and had never been working but you may find a few interesting thing (I mostly copied a part of llvm gcc) [not llvm::scoped_ptr has been renamed llvm::OwnedPtr or something like this since then].

Regards,

Cédric

CodeGenDebugInfo.cpp (3.52 KB)

CodeGenDebugInfo.h (1.52 KB)

When it is ready, you'll probably be among the first to know among PIC
users

Cheers,
A.

I did some (very) little work on this in november, but didn't
had time to get patch out. Some bit of information you could
find interesting:

And attached a first try to add function definition debug
information. It is out of date and had never been working but
you may find a few interesting thing (I mostly copied a part
of llvm gcc) [not llvm::scoped_ptr has been renamed
llvm::OwnedPtr or something like this since then].

Thanks Chris and Cédric for your inputs.
I will keep discussing on the list any issues I face while working on this.

- Sanjiv

To get started, I'd suggest beginning with simple line number
information. When you can emit the llvm.dbg.stoppoint
intrinsics (and the metadata that describes translation units
etc) you should be able to step through a program with GDB.
After that, I'd tackle variable/function descriptors, type
descriptors, etc.

Llvm-gcc uses TreeToLLVM class to generate code for a function.
The Emit(tree exp, const MemRef *DestLoc) method of that class handles
code generation for expressions and also emits the stoppoint intrinsics
as the first thing.

I see that the corresponding place in clang is
CodeGenFunction::EmitScalarExpr (const Expr *E).
We can get the getExprLoc() from the Expr and use it to set the
SourceLocation (File, Line) of the CodeGenDebugInfo instance everytime
we emit an intrinsic.

Am I looking at the right places? Opinions welcome.

- Sanjiv

Llvm-gcc uses TreeToLLVM class to generate code for a function.
The Emit(tree exp, const MemRef *DestLoc) method of that class handles
code generation for expressions and also emits the stoppoint intrinsics
as the first thing.

I see that the corresponding place in clang is
CodeGenFunction::EmitScalarExpr (const Expr *E).
We can get the getExprLoc() from the Expr and use it to set the
SourceLocation (File, Line) of the CodeGenDebugInfo instance everytime
we emit an intrinsic.

Am I looking at the right places? Opinions welcome.

- Sanjiv

I may be wrong, but shouldn't you better put it in
CodeGenFunction::EmitStmt. ScalarExpr is only one Expr on scalar type which
is one kind of Stmt. You would miss a lot of interesting stoppoint.
Also, for expression, you probably don't wan't to put a stoppoint for each
operation. I don't know what would be the best granularity for this...

Regards,

I may be wrong, but shouldn't you better put it in
CodeGenFunction::EmitStmt. ScalarExpr is only one Expr on
scalar type which is one kind of Stmt. You would miss a lot
of interesting stoppoint.
Also, for expression, you probably don't wan't to put a
stoppoint for each operation. I don't know what would be the
best granularity for this...

Regards,

--
Cédric

I was thinking that putting it in CodeGenFunction::EmitStmt may result in redundant stoppoint being emitted. But that is taken care of by EmitStopPoint function itself, which checks to see if we have changed from the previous line number. So EmitStmt looks the correct place.

Other things I wanted to know:

1. Where does an instance of CodeGenDebugInfo should be kept? The choices are:
  (A) CodeGenerator inside the ModuleBuilder.cpp. But for that we will need to provide the CodeGenDebugInfo.h inside include/clang/CodeGen.
  (B) As a member of CodeGenModule, and construct it when the CodeGenModule is constructed.

  I am currently using (B).

Of course we will need to use a command-line flag to construct the CodeGenDebugInfo.

2. llvm-gcc uses FullPath to create a CompileUnitCache. I want to use SourceLocation (an unsigned int) for that purpose. Is that unique for full paths?

3. SourceManager does not provide APIs like GetDirName (), GetFullPath (). GetDirName() should be easy to provide as we can get the DirectoryEntry from Loc and then its name. Should I add this one?
GetFullPath () would be little tricky to implement due to path name variations on different platforms.

-Sanjiv

I was thinking that putting it in CodeGenFunction::EmitStmt may result in redundant stoppoint being emitted. But that is taken care of by EmitStopPoint function itself, which checks to see if we have changed from the previous line number. So EmitStmt looks the correct place.

Yep, that makes sense to me too.

Other things I wanted to know:

1. Where does an instance of CodeGenDebugInfo should be kept? The choices are:
  (A) CodeGenerator inside the ModuleBuilder.cpp. But for that we will need to provide the CodeGenDebugInfo.h inside include/clang/CodeGen.
  (B) As a member of CodeGenModule, and construct it when the CodeGenModule is constructed.

  I am currently using (B).

(B) seems right.

Of course we will need to use a command-line flag to construct the CodeGenDebugInfo.

-g! :slight_smile:

2. llvm-gcc uses FullPath to create a CompileUnitCache. I want to use SourceLocation (an unsigned int) for that purpose. Is that unique for full paths?

I'm not sure what you mean, what do you need here exactly? a unique ID for each source file / header in a translation unit?

3. SourceManager does not provide APIs like GetDirName (), GetFullPath (). GetDirName() should be easy to provide as we can get the DirectoryEntry from Loc and then its name. Should I add this one?
GetFullPath () would be little tricky to implement due to path name variations on different platforms.

The filename for a file should be returned by 'SourceMgr.getSourceName(Loc);'. Does this work for you?

-Chris

> 3. SourceManager does not provide APIs like GetDirName (),
GetFullPath
> (). GetDirName() should be easy to provide as we can get the
> DirectoryEntry from Loc and then its name. Should I add this one?
> GetFullPath () would be little tricky to implement due to path name
> variations on different platforms.

The filename for a file should be returned by
'SourceMgr.getSourceName(Loc);'. Does this work for you?

-Chris

%llvm.dbg.compile_unit.type = type {
    uint, ;; Tag = 17 + LLVMDebugVersion (DW_TAG_compile_unit)
    { }*, ;; Compile unit anchor = cast = (%llvm.dbg.anchor.type* %llvm.dbg.compile_units to { }*)
    uint, ;; Dwarf language identifier (ex. DW_LANG_C89)
    sbyte*, ;; Source file name
    sbyte*, ;; Source file directory (includes trailing slash)
    sbyte* ;; Producer (ex. "4.0.1 LLVM (LLVM research group)")
  }

I do not know how to retrieve last two pieces of information here.
DirectoryEntry.getName() gives relative path but not absolute.
What is the API to retrive version string?

- Sanjiv

3. SourceManager does not provide APIs like GetDirName (),

GetFullPath

(). GetDirName() should be easy to provide as we can get the
DirectoryEntry from Loc and then its name. Should I add this one?
GetFullPath () would be little tricky to implement due to path name
variations on different platforms.

The filename for a file should be returned by
'SourceMgr.getSourceName(Loc);'. Does this work for you?

-Chris

%llvm.dbg.compile_unit.type = type {
   uint, ;; Tag = 17 + LLVMDebugVersion (DW_TAG_compile_unit)
   { }*, ;; Compile unit anchor = cast = (%llvm.dbg.anchor.type* %llvm.dbg.compile_units to { }*)
   uint, ;; Dwarf language identifier (ex. DW_LANG_C89)
   sbyte*, ;; Source file name
   sbyte*, ;; Source file directory (includes trailing slash)
   sbyte* ;; Producer (ex. "4.0.1 LLVM (LLVM research group)")
}

I do not know how to retrieve last two pieces of information here.
DirectoryEntry.getName() gives relative path but not absolute.

Ok, use SourceMgr.getFileEntryForLoc(Loc), which returns a FileEntry. A FileEntry has a 'getName()' accessor for the file name, and 'getDir()' which returns a directory entry. DirectoryEntry has 'getName()' to get the name of the dir.

What is the API to retrive version string?

I would just set it to 'clang' or something.

-Chris

> I do not know how to retrieve last two pieces of information here.
> DirectoryEntry.getName() gives relative path but not absolute.

Ok, use SourceMgr.getFileEntryForLoc(Loc), which returns a
FileEntry.
A FileEntry has a 'getName()' accessor for the file name, and
'getDir()' which returns a directory entry. DirectoryEntry
has 'getName()' to get the name of the dir.

Chris,
I meant the same thing when I said "DirectoryEntry.getName()."
But the problem is that it doesn't give absolute path.
llvm-gcc keeps absolute path in compile_unit.

> What is the API to retrive version string?

I would just set it to 'clang' or something.

Fine.

- Sanjiv

I don't think you need an absolute path. If you really want it, you can prefix 'pwd' onto it.

-Chris

Well, putting it into EmitStmt still results into unnecessay stoppoints
being generated.

For a piece of code like
1: foo ()
2: {
3: int I = 5;
4: }

Two stoppoints will be generated for line 2 and line 3;

We do not want to generate a stoppoint for a '{' (CompoundStmt).

I am thinking to go back to my earlier thoughts and put it into below
functions:
EmitScalarExpr
EmitComplexExpr
EmitAggExpr

- Sanjiv

I was thinking that putting it in CodeGenFunction::EmitStmt

may result

in redundant stoppoint being emitted. But that is taken care of by
EmitStopPoint function itself, which checks to see if we

have changed

from the previous line number. So EmitStmt looks the correct place.

Yep, that makes sense to me too.

Well, putting it into EmitStmt still results into unnecessay stoppoints
being generated.

For a piece of code like
1: foo ()
2: {
3: int I = 5;
4: }

Two stoppoints will be generated for line 2 and line 3;

We do not want to generate a stoppoint for a '{' (CompoundStmt).

I think it is probably best to put it into EmitStmt and filter out stmts that you don't want stoppoints for. EmitStopPoint itself should avoid emitting multiple stoppoints on the same line. For example, in "x = 4; y = 1; z = 12;" we only want one stoppoint.

I am thinking to go back to my earlier thoughts and put it into below
functions:
EmitScalarExpr
EmitComplexExpr
EmitAggExpr

The problem with this is that it means that you'll get tons of duplicated stoppoints for every subexpression "2*a + 1" would have 5 stoppoints, most of which would get filtered out. Are you sure about this?

-Chris