RFC: Supporting macros in LLVM debug info

Hi,

I would like to implement macro debug info support in LLVM.

Below you will find 4 parts:

  1. Background on what does it mean to debug macros.

  2. A brief explanation on how to represent macro debug info in DWARF 4.0.

  3. The suggested design.

  4. A full example: Source → AST → LLVM IR → DWARF.

Feel free to skip first two parts if you think you know the background.

Please, let me know if you have any comment or feedback on this approach.

Thanks,

Amjad

[Background]

There are two kind of macro definition:

  1. Simple macro definition, e.g. #define M1 Value1

  2. Function macro definition, e.g. #define M2(x, y) (x) + (y)

Macro scope starts with the “#define” directive and ends with “#undef” directive.

GDB supports debugging macros. This means, it can evaluate the macro expression for all macros, which have a scope that interleaves with the current breakpoint.

For example:

GDB command: print M2(3, 5)

GDB Result: 8

GDB can evaluate the macro expression based on the “.debug_macroinfo” section (DWARF 4.0).

[DWARF 4.0 “.debug_macroinfo” section]

In this section there are 4 kinds of entries

  1. DW_MACROINFO_define

  2. DW_MACROINFO_undef

  3. DW_MACROINFO_start_file

  4. DW_MACROINFO_end_file

Note: There is a 5th kind of entry for vendor specific macro information, that we do not need to support.

The first two entries contain information about the line number where the macro is defined/undefined, and a null terminated string, which contain the macro name (followed by the replacement value in case of a definition, or a list of parameters then the replacement value in case of function macro definition).

The third entry contains information about the line where the file was included followed by the file id (an offset into the files table in the debug line section).

The fourth entry contains nothing, and it just close the previous entry of third kind (start_file) .

Macro definition and file including entries must appear at the same order as they appear in the source file. Where all macro entries between “start_file” and “end_file” entries represent macros appears directly/indirectly in the included file.

Special cases:

  1. The main source file should be the first “start_file” entry in the sequence, and should have line number “0”.

  2. Command line/Compiler definitions must also have line number “0” but must appear before the first “start_file” entry.

  3. Command line include files, must also have line number “0” but will appear straight after the “start_file” of the main source.

[Design]

To support macros the following components need to be modified: Clang, LLVM IR, Dwarf Debug emitter.

In clang, we need to handle these source directives:

  1. #define

  2. #undef

  3. #include

The idea is to make a use of “PPCallbacks” class, which allows preprocessor to notify the parser each time one of the above directives occurs.

These are the callbacks that should be implemented:

“MacroDefined”, “MacroUndefined”, “FileChanged”, and “InclusionDirective”.

AST will be extended to support two new DECL types: “MacroDecl” and “FileIncludeDecl”.

Where “FileIncludeDecl” AST might contain other “FileIncludeDecl”/“MacroDecl” ASTs.

These two new AST DECLs are not part of TranslationUnitDecl and are handled separately (see AST example below).

In the LLVM IR, metadata debug info will be extended to support new DIs as well:

“DIMacro”, “DIFileInclude”, and “MacroNode”.

The last, is needed as we cannot use DINode as a base class of “DIMacro” and DIFileInclude" nodes.

DIMacro will contain:

· type (definition/undefinition).

· line number (interger).

· name (null terminated string).

· replacement value (null terminated string - optional).

DIFileMacro will contain:

· line number (interger).

· file (DIFile).

· macro list (MacroNodeArray) - optional.

In addition, the DICompileUnit will contain a new optional field of macro list of type (MacroNodeArray).

Finally, I assume that macro support should be disabled by default, and there should be a flag to enable this feature. I would say that we should introduce a new specific flag, e.g. “-gmacro”, that could be used with “-g”.

[Example]

Here is an example that demonstrate the macro support from Source->AST->LLVM IR->DWARF.

Source

Hi,
I would like to implement macro debug info support in LLVM.
Below you will find 4 parts:
1. Background on what does it mean to debug macros.
2. A brief explanation on how to represent macro debug info in DWARF 4.0.

Thanks for looking into that!
Without having read the remainder of proposal yet — are you aware that the DWARF4 macro support is being replaced with a more efficient representation in DWARF5?

http://dwarfstd.org/ShowIssue.php?issue=110722.1

It may make sense to either only implement the new version or design it in a way to support both formats.
-- adrian

Keep forgetting I have to explicitly send to the list...

From: llvm-dev [mailto:llvm-dev-bounces@lists.llvm.org] On Behalf Of
Adrian Prantl via llvm-dev
Sent: Wednesday, October 28, 2015 10:37 AM
To: Aboud, Amjad via llvm-dev
Cc: llvm-dev@lists.llvm.org
Subject: Re: [llvm-dev] RFC: Supporting macros in LLVM debug info

>
> Hi,
> I would like to implement macro debug info support in LLVM.
> Below you will find 4 parts:
> 1. Background on what does it mean to debug macros.
> 2. A brief explanation on how to represent macro debug info in
DWARF 4.0.

Thanks for looking into that!
Without having read the remainder of proposal yet — are you aware that the
DWARF4 macro support is being replaced with a more efficient
representation in DWARF5?

DWARF Issue

It may make sense to either only implement the new version or design it in
a way to support both formats.
-- adrian

Unless you're going to do "macro units" I think the two formats are
basically equivalent, differing only in the details of how the strings
are found. So, I'd expect the AST and metadata parts to work fine either
way, and only the details of the final object-file sections would vary.
--paulr

Hi Adrian
I am aware of DWARF 5.0, and would like to support it once it is official.
However, my impression was that the only difference in Design to support DWARF 5.0 would be in debug info emitter and dwarfdump components.
So, I agree with Paul on that.

Please, let me know if I am missing anything.

Thanks,
Amjad

Hi,

I would like to implement macro debug info support in LLVM.

Below you will find 4 parts:

1. Background on what does it mean to debug macros.

2. A brief explanation on how to represent macro debug info in DWARF
4.0.

3. The suggested design.

4. A full example: Source -> AST -> LLVM IR -> DWARF.

Feel free to skip first two parts if you think you know the background.

Please, let me know if you have any comment or feedback on this approach.

Thanks,

Amjad

*[Background]*

There are two kind of macro definition:

1. Simple macro definition, e.g. #define M1 Value1

2. Function macro definition, e.g. #define M2(x, y) (x) + (y)

Macro scope starts with the "#define" directive and ends with "#undef"
directive.

GDB supports debugging macros. This means, it can evaluate the macro
expression for all macros, which have a scope that interleaves with the
current breakpoint.

For example:

GDB command: print M2(3, 5)

GDB Result: 8

GDB can evaluate the macro expression based on the ".debug_macroinfo"
section (DWARF 4.0).

*[DWARF 4.0 ".debug_macroinfo" section]*

In this section there are 4 kinds of entries

1. DW_MACROINFO_define

2. DW_MACROINFO_undef

3. DW_MACROINFO_start_file

4. DW_MACROINFO_end_file

Note: There is a 5th kind of entry for vendor specific macro information,
that we do not need to support.

The first two entries contain information about the line number where the
macro is defined/undefined, and a null terminated string, which contain the
macro name (followed by the replacement value in case of a definition, or a
list of parameters then the replacement value in case of function macro
definition).

The third entry contains information about the line where the file was
included followed by the file id (an offset into the files table in the
debug line section).

The fourth entry contains nothing, and it just close the previous entry of
third kind (start_file) .

Macro definition and file including entries must appear at the same order
as they appear in the source file. Where all macro entries between
"start_file" and "end_file" entries represent macros appears
directly/indirectly in the included file.

Special cases:

1. The main source file should be the first "start_file" entry in
the sequence, and should have line number "0".

2. Command line/Compiler definitions must also have line number "0"
but must appear before the first "start_file" entry.

3. Command line include files, must also have line number "0" but
will appear straight after the "start_file" of the main source.

*[Design]*

To support macros the following components need to be modified: Clang,
LLVM IR, Dwarf Debug emitter.

In clang, we need to handle these source directives:

1. #define

2. #undef

3. #include

The idea is to make a use of "PPCallbacks" class, which allows
preprocessor to notify the parser each time one of the above directives
occurs.

These are the callbacks that should be implemented:

"MacroDefined", "MacroUndefined", "FileChanged", and "InclusionDirective".

AST will be extended to support two new DECL types: "MacroDecl" and
"FileIncludeDecl".

Do we really need to touch the AST? Or would it be reasonable to wire up
the CGDebugInfo directly to the PPCallbacks, if it isn't already? (perhaps
it is already wired up for other reasons?)

Where "FileIncludeDecl" AST might contain other
"FileIncludeDecl"/"MacroDecl" ASTs.

These two new AST DECLs are not part of TranslationUnitDecl and are
handled separately (see AST example below).

In the LLVM IR, metadata debug info will be extended to support new DIs as
well:

"DIMacro", "DIFileInclude", and "MacroNode".

The last, is needed as we cannot use DINode as a base class of "DIMacro"
and DIFileInclude" nodes.

DIMacro will contain:

· type (definition/undefinition).

· line number (interger).

· name (null terminated string).

· replacement value (null terminated string - optional).

DIFileMacro will contain:

· line number (interger).

· file (DIFile).

· macro list (MacroNodeArray) - optional.

I wonder if it'd be better to use a parent chain style approach (DIMacro
has a DIMacroFile it refers to, each DIMacroFile has another one that it
refers to, up to null)?
(does it ever make sense/need to have a DIMacroFile without any macros in
it? I assume not?)

Might be good to start with dwarfdump support - seems useful regardless of
anything else?

Do we really need to touch the AST? Or would it be reasonable to wire up the CGDebugInfo directly to the PPCallbacks, if it isn’t already? (perhaps it is already wired up for other reasons?)

This sound as a good idea, I will check that approach.

PPCallbacks is only an interface, has nothing connected to it, but we will create a new class, which implement PPCallbacks, for macros. So we can connect whatever we want to that class.

The only drawback with this approach, is that we can test the frontend using the generated LLVM IR, i.e. the whole path, instead of having two tests, AST for testing the parser, and LLVM IR for testing the Sema.

I wonder if it’d be better to use a parent chain style approach (DIMacro has a DIMacroFile it refers to, each DIMacroFile has another one that it refers to, up to null)?
(does it ever make sense/need to have a DIMacroFile without any macros in it? I assume not?)
First, it seems that GCC does emit MacroFile that has no macros inside (I understand that it might not be useful, but I am not sure if we should ignore that or not).

Second, I assume that you are suggesting the parent chain style instead to the current children style, right?

In this case, won’t it make the debug emitter code much complicated to figure out the DFS tree, which should be emitted for the macros, not mentioning the macro order which will be lost?

Also, remember that the command line macros have no DIMacroFile parent.

However, if you meant to use the parent chain in addition to the children list, then what extra information it will give us?

Might be good to start with dwarfdump support - seems useful regardless of anything else?

I agree, and in fact, I already have this code implemented, will upload it for review soon.

Thanks,

Amjad

> Do we really need to touch the AST? Or would it be reasonable to wire up
the CGDebugInfo directly to the PPCallbacks, if it isn't already? (perhaps
it is already wired up for other reasons?)

This sound as a good idea, I will check that approach.

PPCallbacks is only an interface, has nothing connected to it, but we will
create a new class, which implement PPCallbacks, for macros.

Right - I was wondering if CGDebugInfo already implemented PPCallbacks or
was otherwise being notified of PPCallback related things, possibly through
a layer or two of indirection.

So we can connect whatever we want to that class.

The only drawback with this approach, is that we can test the frontend
using the generated LLVM IR, i.e. the whole path, instead of having two
tests, AST for testing the parser, and LLVM IR for testing the Sema.

We don't usually do direct AST tests in Clang for debug info (or for many
things, really) - we just do source -> llvm IR anyway, so that's nothing
out of the ordinary.

> I wonder if it'd be better to use a parent chain style approach (DIMacro
has a DIMacroFile it refers to, each DIMacroFile has another one that it
refers to, up to null)?
> (does it ever make sense/need to have a DIMacroFile without any macros
in it? I assume not?)
First, it seems that GCC does emit MacroFile that has no macros inside (I
understand that it might not be useful, but I am not sure if we should
ignore that or not).

Yeah, that's weird - I'd sort of be inclined to skip it until we know what
it's useful for.

Second, I assume that you are suggesting the parent chain style instead to
the current children style, right?

Correct

In this case, won’t it make the debug emitter code much complicated to
figure out the DFS tree,

I don't quite imagine it would be more complicated - we would just be
building the file parent chain as we go, and keeping the current macro file
around to be used as the parent to any macros we create.

which should be emitted for the macros, not mentioning the macro order
which will be lost?

Not necessarily, if we kept the macros in order in the list of macros
attached to the CU, which I imagine we would.

Also, remember that the command line macros have no DIMacroFile parent.

Fair - they could have the null parent, potentially.

However, if you meant to use the parent chain in addition to the children
list, then what extra information it will give us?

>Might be good to start with dwarfdump support - seems useful regardless
of anything else?

I agree, and in fact, I already have this code implemented, will upload it
for review soon.

Cool

Not necessarily, if we kept the macros in order in the list of macros attached to the CU, which I imagine we would.

OK, now I understand what you are aiming for. I really do not favor one on the other.

But, can you explain what is the advantage of the parent approach over the children approach?

If any, the children approach seems to be the one reduces the LLVM IR size, is not it?

Regards,

Amjad

> Not necessarily, if we kept the macros in order in the list of macros
attached to the CU, which I imagine we would.

OK, now I understand what you are aiming for. I really do not favor one on
the other.

But, can you explain what is the advantage of the parent approach over the
children approach?

Not too much in it, really. The only thing I'd wonder about is whether the
parent approach would work better for LTO or not. Bit of a toss-up perhaps.
If each file generally produces the same macros (ie: no per-file macro
weirdness causing different sets of macros to come out of the same file)
then a parent->child structure should deduplicate fine under LTO, I think.

How is the macinfo referenced by the rest of the debug info?

If any, the children approach seems to be the one reduces the LLVM IR
size, is not it?

Hmm... yes, good point. I suppose it would involve twice as many pointers
to the relevant nodes (the child nodes move from the parent's child list to
the child's parent pointer, a zero-cost change, but then you add another
pointer to each child from the primary list)

OK - yeah, I'm fine with a top down design as you have it. (just took me a
little while to think through - since most of our structures are bottom up
to allow new things to be added later/merged during LTO, etc, but that
should be relatively uncommon in this case since we'll be emitting /all/
the macros in a given file (that are enabled, and differently enabled
features in the same program in different files should be relatively
uncommon))

I’d like to jump in. I do not work on a preprocessor based language, but have the same code expansion problem to encode. Right now, we hack around the problem by appending some prefix after the file name and pretend it is a different file, which is not great.

I understand you want to represent expansion by DIFileMacro ? I’m not how this is supposed to be used and it is not in the example. Also, could we use DIExpansion or some other non macro specific name ?

Thanks,

Amaury SECHET

I'd like to jump in. I do not work on a preprocessor based language, but
have the same code expansion problem to encode.

Not quite sure whether this proposal will be relevant/help you.

Right now, we hack around the problem by appending some prefix after the
file name and pretend it is a different file, which is not great.

I understand you want to represent expansion by DIFileMacro ?

Not quite - the purpose of the DWARF macro feature is to describe the
original preprocessor macros (not their application) so that a user in a
debugger can use those macros when evaluating an expression. Some codebases
are heavily macro dependent, so any user wanting to probe/interact with
those codebases in their debugger needs the debugger to be able to evaluate
expressions that use macros.

I'm not how this is supposed to be used and it is not in the example.
Also, could we use DIExpansion or some other non macro specific name ?

Probably not useful/relevant, since the DWARF feature is specifically for
describing macros. Even if it can be used to describe other thingsn (& I'm
not sure whether the things your dealing with would map onto this feature)
we might as well keep with DWARF's taxonomy rather than introducing another
that then has to be mapped into DWARF's taxonomy anyway.

- David

I think you misunderstood. I obviously have no intention to use the macro def/undef part of things. I was more wondering about a DIFileMacro or alike to signify expansion in DILocation rather than regular. Is that in or out of that discussion ?

I /think/ that's probably out of scope, but I'm clearly missing some bits
of context in any case. Perhaps we could have a separate thread about that,
if you're interested in how one might go about supporting what you're
dealing with.

Right - I was wondering if CGDebugInfo already implemented PPCallbacks or was otherwise being notified of PPCallback related things, possibly through a layer or two of indirection.

I checked the approach of skipping representing macros in AST, and communicate them directly from Parser to CGDebugInfo.

However, I could not find a way to initialize this communication.

The only interface available through Parser is either Sema (to create an AST) or ASTConsumer. While the CGDebugInfo is only available in the CodeGenModule, which is accessible from BackendConsumer, that implements ASTConsumer.

David, skipping the AST will save a lot of code, but I need help figuring out how to communicate with the CGDebugInfo.

Thanks,

Amjad

I found a way to skip representing macros in AST and create them directly in CGDebugInfo through PPCallbacks during preprocessing.

To do that, I needed to extend ASTConsumer interface with this extra method:

/// If the consumer is interested in notifications from Preprocessor,

/// for example: notifications on macro definitions, etc., it should return

/// a pointer to a PPCallbacks here.

/// The caller takes ownership on the returned pointer.

virtual PPCallbacks *CreatePreprocessorCallbacks() { return nullptr; }

Then the ParseAST can use it to add these preprocessor callbacks, which are needed by the AST consumer, to the preprocessor:

S.getPreprocessor().addPPCallbacks(

std::unique_ptr(Consumer->CreatePreprocessorCallbacks()));

With this, approach the change in clang to support macros is very small.

Do you agree to this approach?

Thanks,

Amjad

I found a way to skip representing macros in AST and create them directly
in CGDebugInfo through PPCallbacks during preprocessing.

To do that, I needed to extend ASTConsumer interface with this extra
method:

  /// If the consumer is interested in notifications from Preprocessor,

  /// for example: notifications on macro definitions, etc., it should
return

  /// a pointer to a PPCallbacks here.

  /// The caller takes ownership on the returned pointer.

  virtual PPCallbacks *CreatePreprocessorCallbacks() { return nullptr; }

Then the ParseAST can use it to add these preprocessor callbacks, which
are needed by the AST consumer, to the preprocessor:

  S.getPreprocessor().addPPCallbacks(

      std::unique_ptr<PPCallbacks
>(Consumer->CreatePreprocessorCallbacks()));

(CreatePreprocessorCallbacks, if that's the path we take, should return a
unique_ptr itself rather than returning a raw ownership-passing pointer,
but that's a minor API detail)

With this, approach the change in clang to support macros is very small.

Do you agree to this approach?

Richard - what do you reckon's the right hook/path to get preprocessor info
through to codegen (& CGDebugInfo in particular). Would a general purpose
hook in the ASTConsumer be appropriate/useful?

I found a way to skip representing macros in AST and create them directly
in CGDebugInfo through PPCallbacks during preprocessing.

To do that, I needed to extend ASTConsumer interface with this extra
method:

  /// If the consumer is interested in notifications from Preprocessor,

  /// for example: notifications on macro definitions, etc., it should
return

  /// a pointer to a PPCallbacks here.

  /// The caller takes ownership on the returned pointer.

  virtual PPCallbacks *CreatePreprocessorCallbacks() { return nullptr; }

Then the ParseAST can use it to add these preprocessor callbacks, which
are needed by the AST consumer, to the preprocessor:

  S.getPreprocessor().addPPCallbacks(

      std::unique_ptr<PPCallbacks
>(Consumer->CreatePreprocessorCallbacks()));

(CreatePreprocessorCallbacks, if that's the path we take, should return a
unique_ptr itself rather than returning a raw ownership-passing pointer,
but that's a minor API detail)

With this, approach the change in clang to support macros is very small.

Do you agree to this approach?

Richard - what do you reckon's the right hook/path to get preprocessor
info through to codegen (& CGDebugInfo in particular). Would a general
purpose hook in the ASTConsumer be appropriate/useful?

ASTConsumer shouldn't know anything about the preprocessor; there's no
reason to think, in general, that the AST is being produced by
preprocessing and parsing some text. Perhaps adding a PreprocessorConsumer
interface akin to the existing SemaConsumer interface would be a better way
to go.

Thanks,

I found a way to skip representing macros in AST and create them
directly in CGDebugInfo through PPCallbacks during preprocessing.

To do that, I needed to extend ASTConsumer interface with this extra
method:

  /// If the consumer is interested in notifications from Preprocessor,

  /// for example: notifications on macro definitions, etc., it should
return

  /// a pointer to a PPCallbacks here.

  /// The caller takes ownership on the returned pointer.

  virtual PPCallbacks *CreatePreprocessorCallbacks() { return nullptr; }

Then the ParseAST can use it to add these preprocessor callbacks, which
are needed by the AST consumer, to the preprocessor:

  S.getPreprocessor().addPPCallbacks(

      std::unique_ptr<PPCallbacks
>(Consumer->CreatePreprocessorCallbacks()));

(CreatePreprocessorCallbacks, if that's the path we take, should return a
unique_ptr itself rather than returning a raw ownership-passing pointer,
but that's a minor API detail)

With this, approach the change in clang to support macros is very small.

Do you agree to this approach?

Richard - what do you reckon's the right hook/path to get preprocessor
info through to codegen (& CGDebugInfo in particular). Would a general
purpose hook in the ASTConsumer be appropriate/useful?

ASTConsumer shouldn't know anything about the preprocessor; there's no
reason to think, in general, that the AST is being produced by
preprocessing and parsing some text.

Hmm, I suppose a fair question then - would it be possible to implement
debug info for macros when reading ASTs from a serialized file (without a
preprocessor). Or should we actually go with the original proposal of
creating AST nodes for the preprocessor events so we can have access to
them after loading serialized modules & then generating debug info from
them? Is there some other side table we'd be better off using/creating for
this task?

I found a way to skip representing macros in AST and create them
directly in CGDebugInfo through PPCallbacks during preprocessing.

To do that, I needed to extend ASTConsumer interface with this extra
method:

  /// If the consumer is interested in notifications from Preprocessor,

  /// for example: notifications on macro definitions, etc., it should
return

  /// a pointer to a PPCallbacks here.

  /// The caller takes ownership on the returned pointer.

  virtual PPCallbacks *CreatePreprocessorCallbacks() { return nullptr;
}

Then the ParseAST can use it to add these preprocessor callbacks, which
are needed by the AST consumer, to the preprocessor:

  S.getPreprocessor().addPPCallbacks(

      std::unique_ptr<PPCallbacks
>(Consumer->CreatePreprocessorCallbacks()));

(CreatePreprocessorCallbacks, if that's the path we take, should return
a unique_ptr itself rather than returning a raw ownership-passing pointer,
but that's a minor API detail)

With this, approach the change in clang to support macros is very small.

Do you agree to this approach?

Richard - what do you reckon's the right hook/path to get preprocessor
info through to codegen (& CGDebugInfo in particular). Would a general
purpose hook in the ASTConsumer be appropriate/useful?

ASTConsumer shouldn't know anything about the preprocessor; there's no
reason to think, in general, that the AST is being produced by
preprocessing and parsing some text.

Hmm, I suppose a fair question then - would it be possible to implement
debug info for macros when reading ASTs from a serialized file (without a
preprocessor). Or should we actually go with the original proposal of
creating AST nodes for the preprocessor events so we can have access to
them after loading serialized modules & then generating debug info from
them? Is there some other side table we'd be better off using/creating for
this task?

It would make sense to split the preprocessor into separate layers for
holding the macro / other state information and for actually performing
preprocessing (that is, we'd have a separate "preprocessor AST" containing
just the macro information), similar to the AST / Sema split, but that's a
rather large task. In the mean time, we would need to require people to set
up a preprocessor to deserialize into (even though they're never going to
actually preprocess anything) so that they have somewhere to put the macros
before feeding them to CodeGen. That doesn't seem like a huge imposition.

But the case I was thinking of wasn't actually deserialized ASTs (for which
there usually is some preprocessor state available somewhere), it's cases
like lldb, swig-like tools or clang plugins that synthesize AST nodes out
of thin air. CodeGen should be prepared to generate code from a world where
no preprocessor ever existed, and we shouldn't make the ASTConsumer
interface imply that macros are part of the AST -- we should present them
as an (optional) separate layer.

Perhaps adding a PreprocessorConsumer interface akin to the existing

I found a way to skip representing macros in AST and create them
directly in CGDebugInfo through PPCallbacks during preprocessing.

To do that, I needed to extend ASTConsumer interface with this extra
method:

  /// If the consumer is interested in notifications from Preprocessor,

  /// for example: notifications on macro definitions, etc., it
should return

  /// a pointer to a PPCallbacks here.

  /// The caller takes ownership on the returned pointer.

  virtual PPCallbacks *CreatePreprocessorCallbacks() { return nullptr;
}

Then the ParseAST can use it to add these preprocessor callbacks,
which are needed by the AST consumer, to the preprocessor:

  S.getPreprocessor().addPPCallbacks(

      std::unique_ptr<PPCallbacks
>(Consumer->CreatePreprocessorCallbacks()));

(CreatePreprocessorCallbacks, if that's the path we take, should return
a unique_ptr itself rather than returning a raw ownership-passing pointer,
but that's a minor API detail)

With this, approach the change in clang to support macros is very
small.

Do you agree to this approach?

Richard - what do you reckon's the right hook/path to get preprocessor
info through to codegen (& CGDebugInfo in particular). Would a general
purpose hook in the ASTConsumer be appropriate/useful?

ASTConsumer shouldn't know anything about the preprocessor; there's no
reason to think, in general, that the AST is being produced by
preprocessing and parsing some text.

Hmm, I suppose a fair question then - would it be possible to implement
debug info for macros when reading ASTs from a serialized file (without a
preprocessor). Or should we actually go with the original proposal of
creating AST nodes for the preprocessor events so we can have access to
them after loading serialized modules & then generating debug info from
them? Is there some other side table we'd be better off using/creating for
this task?

It would make sense to split the preprocessor into separate layers for
holding the macro / other state information and for actually performing
preprocessing (that is, we'd have a separate "preprocessor AST" containing
just the macro information), similar to the AST / Sema split, but that's a
rather large task. In the mean time, we would need to require people to set
up a preprocessor to deserialize into (even though they're never going to
actually preprocess anything) so that they have somewhere to put the macros
before feeding them to CodeGen. That doesn't seem like a huge imposition.

Maybe it's just the week I've had (& perhaps Amjad can make more sense of
it) but I'm having a hard time picturing waht you're suggesting.

You're saying currently when loading modules (which do have macros & such
in them, so there's some "preprocessor-y" things going on) we do
<something> but instead/in addition we could build a Preprocessor and
populate it (it doesn't have any representation for this currently? we'd
have to add a side table in Preprocessor for these reconstituted macro
things?) from the module - then, separately, decide how the information
gets from the Preprocessor to CodeGen?

But the case I was thinking of wasn't actually deserialized ASTs (for
which there usually is some preprocessor state available somewhere), it's
cases like lldb, swig-like tools or clang plugins that synthesize AST nodes
out of thin air. CodeGen should be prepared to generate code from a world
where no preprocessor ever existed, and we shouldn't make the ASTConsumer
interface imply that macros are part of the AST -- we should present them
as an (optional) separate layer.

OK - any ideas/suggestions/preferences on how we get the stuff in
Preprocessor into CodeGen/CGDebugInfo? I'm just not quite picturing how
this all lines up, but haven't looked at the boundaries in much detail/know
them well.

Thanks a bunch,
- Dave