[PATCH/DRAFT] Embed metadata into object file

Hi,

so this is my first contribution to LLVM/clang, so I hope I come close
to the required coding standards and guidelines.

First, I will describe the scenario I want to solve: For a few days, the
clang plugin interface allows to execute the a plugin just before the
actual main action (e.g., compiling an translation unit). In my case,
the plugin we're developing will analyze the AST and generate some
information. This information should be available in the generated
object file. So, I had to solve the question: how do I smuggle some
metadata on the module level, from the frontend to the generated object
file (in an hopefully sane and reusable manner).

So, the solution I thought about was the following. The gathered
information is but into a NamedMetadataStringAttr and attached to the
TranslationUnitDecl. This would be, as far as I can see, the first
annotation on the TranslationUnit level. Attributes seem to me the best
solution at that point, since there is already a good infrastructure in
clang.

   TranslationUnitDecl
   > NamedMetadataAttr implicit llvm.extra_section clang.analysis "Information"

The clang CodeGen then generates a named metadata node on the
llvm::Module level:

   !llvm.extra_sections = {!0}
   !0 = !{!"clang.analysis", !"Information"}

The attached patch, then takes all key-value pairs from the named
llvm.extra_sections named metadata, and appends them as zero-terminated
strings to the desired section.

I'm not sure, wheter this is the best solution to the problem, and
wheter it is general enough. I was also thinking about the possibility
to attach LLVM passes from the plugin interface to the clang/CodeGen
backend, but I could not figure out how to do this without breaking all
the used abstraction.

I post this to llvm-dev, as well as, to cfe-dev, since both changes,
although idependend of each other, relate to one another.

chris

0001-Attach-LLVM-metadata-with-NamedMetadataStringAttr.patch (3.34 KB)

0001-Use-llvm.extra_section-to-smuggle-data-into-object-s.patch (9.19 KB)

Hi,

so this is my first contribution to LLVM/clang, so I hope I come close
to the required coding standards and guidelines.

First, I will describe the scenario I want to solve: For a few days, the
clang plugin interface allows to execute the a plugin just before the
actual main action (e.g., compiling an translation unit). In my case,
the plugin we're developing will analyze the AST and generate some
information. This information should be available in the generated
object file. So, I had to solve the question: how do I smuggle some
metadata on the module level, from the frontend to the generated object
file (in an hopefully sane and reusable manner).

So, the solution I thought about was the following. The gathered
information is but into a NamedMetadataStringAttr and attached to the
TranslationUnitDecl. This would be, as far as I can see, the first
annotation on the TranslationUnit level. Attributes seem to me the best
solution at that point, since there is already a good infrastructure in
clang.

  TranslationUnitDecl
  > NamedMetadataAttr implicit llvm.extra_section clang.analysis "Information"

The clang CodeGen then generates a named metadata node on the
llvm::Module level:

  !llvm.extra_sections = {!0}
  !0 = !{!"clang.analysis", !"Information"}

The attached patch, then takes all key-value pairs from the named
llvm.extra_sections named metadata, and appends them as zero-terminated
strings to the desired section.

Depending on your needs, just using a global with the “section” attribute might also work for you:
http://llvm.org/docs/LangRef.html#global-variables

-- adrian

Adrian Prantl <aprantl@apple.com> writes:

Depending on your needs, just using a global with the “section”
attribute might also work for you:
LLVM Language Reference Manual — LLVM 18.0.0git documentation

I was aware of that possibility. But, there are several drawbacks to
using global variables and the current infrastructure from a clang
plugin's point of view:

1. It feels like a hack. It is not an idiomatic way of transporting
   information alongside with an translation unit.

2. A clang plugin is mostly defined by it's ASTConsumer; there is no
   direct access to the produced LLVM intermediate representation. I
   would have to insert AST elements that result in the attributed
   global variable. [1]

3. A more idealistic problem I have with this solution is that we change
   the actual (semantic) content of the module. But I only want to carry
   _metadata_ about the module alongside.

4. There already is an good metadata infrastructure in LLVM and a very
   good attribute infrastructure in clang. Why not utilize them?

So my question is:

   I want to generate and attach (structured) metadata at every point in
   life cycle (AST, LLVM module, object file) of a module and retrieve it
   from the compilation results. How can I do that?

At the moment, my patches are only a proposal to solve this for a very
specifc use case (mine): attaching single strings. I think, furthermore,
that it would also be handy to materialize and retrieve more complex
metadata types to/from object files. For example, we could serialize
LLVM metadata trees into separate sections:

   !llvm.extra_sections = {!0}
   !0 = !{!"clang.myanalysis", !1, !2}
   !1 = !{ i32 15, i32 28, i32 142}
   !2 = !{!"key", i32 10000, !"key2", i32 9999}

I'm not sure whether this would be useful and reusable for others. So:
Do you think this would be useful for other developers as well?

chris

[1] This is a _seperate_ drawback of clang plugins at the moment: They
    cannot define an LLVM IR transformation that should be applied when
    the plugin is active. I think it would be useful for many analyzes
    to have access both to the AST and to the IR.

Adrian Prantl <aprantl@apple.com> writes:

Depending on your needs, just using a global with the “section”
attribute might also work for you:
LLVM Language Reference Manual — LLVM 18.0.0git documentation

I was aware of that possibility. But, there are several drawbacks to
using global variables and the current infrastructure from a clang
plugin's point of view:

1. It feels like a hack. It is not an idiomatic way of transporting
  information alongside with an translation unit.

2. A clang plugin is mostly defined by it's ASTConsumer; there is no
  direct access to the produced LLVM intermediate representation. I
  would have to insert AST elements that result in the attributed
  global variable. [1]

3. A more idealistic problem I have with this solution is that we change
  the actual (semantic) content of the module. But I only want to carry
  _metadata_ about the module alongside.

4. There already is an good metadata infrastructure in LLVM and a very
  good attribute infrastructure in clang. Why not utilize them?

So my question is:

  I want to generate and attach (structured) metadata at every point in
  life cycle (AST, LLVM module, object file) of a module and retrieve it
  from the compilation results. How can I do that?

Worded like that, I can see a close analogy with "Debug Info".

At the moment, my patches are only a proposal to solve this for a very
specifc use case (mine): attaching single strings. I think, furthermore,
that it would also be handy to materialize and retrieve more complex
metadata types to/from object files. For example, we could serialize
LLVM metadata trees into separate sections:

  !llvm.extra_sections = {!0}
  !0 = !{!"clang.myanalysis", !1, !2}
  !1 = !{ i32 15, i32 28, i32 142}
  !2 = !{!"key", i32 10000, !"key2", i32 9999}

I'm not sure whether this would be useful and reusable for others. So:
Do you think this would be useful for other developers as well?

chris

[1] This is a _seperate_ drawback of clang plugins at the moment: They
   cannot define an LLVM IR transformation that should be applied when
   the plugin is active.

You can write clang plugin that are LLVM pass and insert them in the pipeline.

I think it would be useful for many analyzes
   to have access both to the AST and to the IR.

This is more fuzzy to me, I don't know enough about clang but I'm not sure the design allow to keep a link from the IR to the clang AST? (If it is the case, I'd be curious to see how it works).

Best,

Mehdi Amini <mehdi.amini@apple.com> writes:

Worded like that, I can see a close analogy with "Debug Info".

Indeed, it is very similar, but there are some differences and
shortcoming, if a developer only wants to smuggle some metadata out in a
very specific format:

For the IR->ELF path, the debug information is encoded as Dwarf (or
  something else) in the binary. The plugin developer has not much
  control about the binary format of the data. !llvm.extra_sections
  would give a quite fine-grained control.

For the AST->IR path, I don't see an easy way to annotate a few pieces
  of information in the AST without spinning up the whole
  debug-information generation process.

[1] This is a _seperate_ drawback of clang plugins at the moment: They
   cannot define an LLVM IR transformation that should be applied when
   the plugin is active.

You can write clang plugin that are LLVM pass and insert them in the
pipeline.

I can? How? I have not seen any possibility to inject a LLVM pass into
the Clang CodeGen PassManager infrastructure from a clang plugin, which
is also FrontentAction like CodeGen.

I think it would be useful for many analyzes
   to have access both to the AST and to the IR.

This is more fuzzy to me, I don't know enough about clang but I'm not
sure the design allow to keep a link from the IR to the clang AST? (If
it is the case, I'd be curious to see how it works).

If you restrict the possible subjects to top-level functions, global
variables, and the compilation unit, this should be implementable quite
straight forward.

chris

Mehdi Amini <mehdi.amini@apple.com> writes:

Worded like that, I can see a close analogy with "Debug Info".

Indeed, it is very similar, but there are some differences and
shortcoming, if a developer only wants to smuggle some metadata out in a
very specific format:

For the IR->ELF path, the debug information is encoded as Dwarf (or
something else) in the binary. The plugin developer has not much
control about the binary format of the data. !llvm.extra_sections
would give a quite fine-grained control.

For the AST->IR path, I don't see an easy way to annotate a few pieces
of information in the AST without spinning up the whole
debug-information generation process.

Yes, I meant conceptually it is the same as debug info, so it makes sense to solve it the same way (i.e. metadata + backend support for codegen).

[1] This is a _seperate_ drawback of clang plugins at the moment: They
  cannot define an LLVM IR transformation that should be applied when
  the plugin is active.

You can write clang plugin that are LLVM pass and insert them in the
pipeline.

I can? How? I have not seen any possibility to inject a LLVM pass into
the Clang CodeGen PassManager infrastructure from a clang plugin, which
is also FrontentAction like CodeGen.

The FrontendAction is totally separated from the LLVM-pass, but I don’t see why your plugin can’t contain both.

To load an LLVM pass into the pipeline, see what Polly is doing for instance: http://polly.llvm.org/example_load_Polly_into_clang.html

Mehdi Amini <mehdi.amini@apple.com> writes:

Yes, I meant conceptually it is the same as debug info, so it makes
sense to solve it the same way (i.e. metadata + backend support for
codegen).

Ah ok, I thought you were suggesting to use the debugging
infrastructure.

Ok, then there are several directions this effort could go:

- The current patch is usable for others in its current form.
  (Transporting only strings)
- A more elaborated and diverse metadata storage format is useful. When
  thinking about that, something like:

  !llvm.extra_sections = {!0, !1, !2}
  !0 = !StringSection(section=!"clang.strings", data=!"My Analysis Backend")
  !1 = !MetadataSection(section=!"clang.metadata", data=!4)
  !2 = !JSONSection(section=!"clang.json", data=!4)

  !4 = !{!"fooo", i32 123, i8 255, !5}
  !5 = !{!"Barfoo"}

  In this fictive example,
    - clang.strings would be a metadata string section, and the most
      flexible possibility.
    - clang.metadata would contain a materialized bitcode stream which
      contains only metadata referenced by !4
    - clang.json would contain a json string, derived from !4

  Would this would only be the LLVM side. How would the path from
  AST->IR look like. For !StringSection, I already drafted a
  possibility.
  
  For !MetadataSection, and JSONSection, it would be necessary to attach
  more complex structures to the AST. I currently don't know how we
  could attach these structures. An attribute with a Metadata * just
  feels bad, since it mixes LLVM and clang types.

- The third possibility is: This is all bullshit and is not needed for LLVM.

The FrontendAction is totally separated from the LLVM-pass, but I
don’t see why your plugin can’t contain both.

To load an LLVM pass into the pipeline, see what Polly is doing for
instance: http://polly.llvm.org/example_load_Polly_into_clang.html

After looking in the code, I realized that it is possible to register
passes by llvm::RegisterStandardPasses, which in executed on plugin
load. I was not aware of this possibility. Thank you very much :slight_smile:
Perhaps it would be nice to have an clang plugin example that contains
as well an FrontentAction, and a associated LLVM pass to demonstrate
this possibility?

chris