New kind of metadata to capture LLVM IR linking structure

Hello all

llvm-link merges together the metadata from the IR files being linked together. This means that when linking different libraries together (i.e. multiple source files that have been compiled into a single LLVM IR file) it can be hard or impossible to identify the library boundaries.

We’re using LLVM to do static analysis of applications (together with their dependent libraries) and have found it useful to be able to determine which library a particular Instruction* or GlobalVariable* came from (e.g. so that we can ignore some of them, or focus analysis on particular ones).

To preserve this information across linking, I’ve implemented a new kind of metadata node MDLLVMModule that records:

  1. Which LLVM modules (i.e. LLVM IR file) have been linked into this LLVM module
  2. Which compilation units directly contribute to this LLVM module (i.e. that are not part of an LLVM submodule)

The format of the metadata looks like this:

!llvm.module = !{!0}

!0 = !MDLLVMModule(name: “test123.bc”, modules: !1, cus: !24)
!1 = !{!2}
!2 = !MDLLVMModule(name: “test12.bc”, cus: !3)
!3 = !{!4, !18}
!4 = !MDCompileUnit(… filename: “test1.c” …)
!18 = !MDCompileUnit(… filename: “test2.c” …)
!24 = !{!25}
!25 = !MDCompileUnit(… filename: “test3.c” …)

Each linked LLVM module has the named metadata node “llvm.module” that points
to its own MDLLVMModule node. In this example, we see that this is the metadata
for llvm module “test123.bc” that is built up from linking module “test12.bc”
and the compilation unit “test3.c.” Module “test12.bc” itself is built up by linking the compilation units “test1.c” and “test2.c”

The name of a module defaults to the base filename of the output file, but this can be overridden with the (also new) command-line flag -module-name to llvm-link, as in:

llvm-link -module-name=mytest -o test.bc

I thought this might be useful to the wider LLVM community and would like to see this added to LLVM.

I have attached a patch that I produced against r232466. I’ve also added a corresponding DILLVMModule class.

I would be very grateful if someone could review this.

Thanks

llvm.module.metadata.patch (18.1 KB)

Hello all

llvm-link merges together the metadata from the IR files being linked
together. This means that when linking different libraries together (i.e.
multiple source files that have been compiled into a single LLVM IR file)
it can be hard or impossible to identify the library boundaries.

We're using LLVM to do static analysis of applications (together with
their dependent libraries) and have found it useful to be able to determine
which library a particular Instruction* or GlobalVariable* came from (e.g.
so that we can ignore some of them, or focus analysis on particular ones).

To preserve this information across linking, I've implemented a new kind
of metadata node MDLLVMModule that records:

1) Which LLVM modules (i.e. LLVM IR file) have been linked into this LLVM
module
2) Which compilation units directly contribute to this LLVM module (i.e.
that are not part of an LLVM submodule)

The format of the metadata looks like this:

!llvm.module = !{!0}

!0 = !MDLLVMModule(name: "test123.bc", modules: !1, cus: !24)
!1 = !{!2}
!2 = !MDLLVMModule(name: "test12.bc", cus: !3)
!3 = !{!4, !18}
!4 = !MDCompileUnit(... filename: "test1.c" ...)
!18 = !MDCompileUnit(... filename: "test2.c" ...)
!24 = !{!25}
!25 = !MDCompileUnit(... filename: "test3.c" ...)

Each linked LLVM module has the named metadata node "llvm.module" that
points
to its own MDLLVMModule node. In this example, we see that this is the
metadata
for llvm module "test123.bc" that is built up from linking module
"test12.bc"
and the compilation unit "test3.c." Module "test12.bc" itself is built up
by linking the compilation units "test1.c" and "test2.c"

The name of a module defaults to the base filename of the output file, but
this can be overridden with the (also new) command-line flag -module-name
to llvm-link, as in:

llvm-link -module-name=mytest -o test.bc <files>

I thought this might be useful to the wider LLVM community and would like
to see this added to LLVM.

I have attached a patch that I produced against r232466. I've also added a
corresponding DILLVMModule class.

What's the benefit/purpose of the MDLLVMModule over just having the
MDCompileUnits themselves? I would imagine the user cares about which
source file the problem was in (obtained from the MDCompileUnit), not the
sequence of BC modules that may've been built into?

Hi David

Thanks for your email.

Hi David

Thanks for your email.

What's the benefit/purpose of the MDLLVMModule over just having the

MDCompileUnits themselves? I would imagine the user cares about which
source file the problem was in (obtained from the MDCompileUnit), not the
sequence of BC modules that may've been built into?

We envisage it to be useful when an analysis tool built using LLVM needs
to know which MDCompileUnits were part of a particular library that has
been linked in. For instance, we're currently analysing the sandboxing
behaviour within the Chromium web browser, which comprises hundreds of
internal libraries and many external ones. To be able to perform this
analysis we have to link them all together into a single .bc/.ll file.

Having the module structure allows us to model interactions between
different modules (without manually (and sometimes unreliably) having to
work out which source file corresponds to which library (e.g. libssl,
libpci, libpolicy, librenderer, etc)). It also allows an analysis tool to
support turning on/off output warnings for particular libraries (as they
can lead to a lot of analysis output).

Fair enough - I've no idea/opinion on whether that's the right abstraction
(other people with more domain knowledge of analysis infrastructure might
chime in with some thoughts).

Practically speaking: would directory paths be sufficient? The
MDCompileUnits already have information about where the source file was.

- David

I agree, this seems very weird. You have very good source location information down to directory/file/line/column for individual instructions in the existing metadata scheme, I’m not sure what this is getting you over that?

-eric

Yes we did consider using directory paths to identify libraries, however there are cases where this doesn’t work. For example, chromium builds a libcommon which mostly consists of source files from the folder chrome/commom/…, but it also contains the file components/nacl/common/pnacl_types.cc (although, other files in that folder are not part of libcommon).

Another example is libbrowser (from chromium) that includes sources files from chrome/browser but also chrome/third_party/mozilla_security_manager.

It would be nice to have a way of reliably identifying which compilation units were part of a library, which is what we were trying to achieve with MDLLVMModule but if there are better abstractions then am all for that.

Just saying, maybe ask the build system to dump comprising files for a library?

Hi David

Thanks for your email.

What's the benefit/purpose of the MDLLVMModule over just having the MDCompileUnits themselves? I would imagine the user cares about which source file the problem was in (obtained from the MDCompileUnit), not the sequence of BC modules that may've been built into?

We envisage it to be useful when an analysis tool built using LLVM needs to know which MDCompileUnits were part of a particular library that has been linked in. For instance, we're currently analysing the sandboxing behaviour within the Chromium web browser, which comprises hundreds of internal libraries and many external ones. To be able to perform this analysis we have to link them all together into a single .bc/.ll file.

Having the module structure allows us to model interactions between different modules (without manually (and sometimes unreliably) having to work out which source file corresponds to which library (e.g. libssl, libpci, libpolicy, librenderer, etc)). It also allows an analysis tool to support turning on/off output warnings for particular libraries (as they can lead to a lot of analysis output).

Fair enough - I've no idea/opinion on whether that's the right abstraction (other people with more domain knowledge of analysis infrastructure might chime in with some thoughts).

Practically speaking: would directory paths be sufficient? The MDCompileUnits already have information about where the source file was.

I agree, this seems very weird. You have very good source location information down to directory/file/line/column for individual instructions in the existing metadata scheme, I'm not sure what this is getting you over that?

Seems weird to me too.

Moreover, this isn't really debug info, and it's not clear that it's
generally useful, so adding first-class support for it via specialized
metadata nodes seems premature.

Hi all

I appreciate the feedback and it looks like recording the information using MD nodes may not have been the right choice.

I quite like the idea of having the build system dump the library information.

Thanks
Khilan

Hi all

I appreciate the feedback and it looks like recording the information
using MD nodes may not have been the right choice.

Just to be clear - there's a few different flavors of metadata. One of them
is explicitly typed/built in to LLVM (such as the debug info metadata
you're seeing), but there's a generic unstructured metadata you can use
that doesn't require modifications to LLVM. Though you might still need
changes to whatever bitcode linking tool you're using to insert your custom
metadata, but it shouldn't be too invasive, I'd imagine?

I quite like the idea of having the build system dump the library
information.

*nod* I'd certainly consider options like this.