Feature proposal: adding optional custom info to clang AST nodes.

Hello.

Our application uses clang to parse a program, build the corresponding AST and then visit the AST so as to collect many kinds of info. Our visitors typically exploit functionality provided by clang AST nodes.
However, sooner or later, we face the need to collect and cache additional info which is of interest for our visitors only: being ad-hoc, this info cannot be arbitrarily stored in clang AST nodes; on the other hand, storing it in the clang AST nodes would result in maximum efficiency, since we would avoid the need to maintain and query (potentially huge) data structures whose only purpose is to map a clang AST node into the corresponding additional info.

For the reasons above, we would like to propose a minor modification to the definition of the classes at the root of the clang AST hierarchy that, when enabled via a suitable configuration option, would let clients add arbitrary info to the clang AST nodes.

The basic idea is to have a new header file defining a suitable macro (e.g., CLANG_CLIENT_CUSTOM_INFO): the name of the header file will be fixed, but its content will be controlled by a clang configuration option. All the root classes for AST nodes (i.e., Decl, Stmt, Type) will include such a header file and will invoke the macro in their public section.

By default, when the customization mechanism is disabled, the macro will expand to no text at all, so as to incur zero overhead:

#define CLANG_CLIENT_CUSTOM_INFO

When customization is enabled, the macro will instead expand to whatever data the client might need; for instance:

#define CLANG_CLIENT_CUSTOM_INFO void* my_data;

Clearly, the client will be responsible of all the problems related with the management of customly allocated resources.

Does this approach sound reasonable?
Do you foresee better alternatives?

Cheers,
Enea Zaffanella.

Hello.

Our application uses clang to parse a program, build the corresponding
AST and then visit the AST so as to collect many kinds of info. Our
visitors typically exploit functionality provided by clang AST nodes.
However, sooner or later, we face the need to collect and cache
additional info which is of interest for our visitors only: being
ad-hoc, this info cannot be arbitrarily stored in clang AST nodes; on
the other hand, storing it in the clang AST nodes would result in
maximum efficiency, since we would avoid the need to maintain and query
(potentially huge) data structures whose only purpose is to map a clang
AST node into the corresponding additional info.

For the reasons above, we would like to propose a minor modification to
the definition of the classes at the root of the clang AST hierarchy
that, when enabled via a suitable configuration option, would let
clients add arbitrary info to the clang AST nodes.

The basic idea is to have a new header file defining a suitable macro
(e.g., CLANG_CLIENT_CUSTOM_INFO): the name of the header file will be
fixed, but its content will be controlled by a clang configuration
option. All the root classes for AST nodes (i.e., Decl, Stmt, Type) will
include such a header file and will invoke the macro in their public
section.

By default, when the customization mechanism is disabled, the macro will
expand to no text at all, so as to incur zero overhead:

#define CLANG_CLIENT_CUSTOM_INFO

When customization is enabled, the macro will instead expand to whatever
data the client might need; for instance:

#define CLANG_CLIENT_CUSTOM_INFO void* my_data;

Clearly, the client will be responsible of all the problems related with
the management of customly allocated resources.

Does this approach sound reasonable?
Do you foresee better alternatives?

Why not use a map that associates the original AST with 'my_data'?

snaroff

I too don't think you should go that direction. Imagine you have clang libraries on the system as shared libraries used by 20 tools. Do we want 20 copies of them or just 1. That answer is just 1. To do that, each app that uses clang has to do so with the same unmodified clang. So, so achieve this, you'd want a hash or a map on the side and put your data in it.

Mike Stump ha scritto:

The basic idea is to have a new header file defining a suitable macro
(e.g., CLANG_CLIENT_CUSTOM_INFO): the name of the header file will be
fixed, but its content will be controlled by a clang configuration
option. All the root classes for AST nodes (i.e., Decl, Stmt, Type)
will
include such a header file and will invoke the macro in their public
section.

I too don't think you should go that direction. Imagine you have
clang libraries on the system as shared libraries used by 20 tools.
Do we want 20 copies of them or just 1. That answer is just 1. To do
that, each app that uses clang has to do so with the same unmodified
clang. So, so achieve this, you'd want a hash or a map on the side
and put your data in it.

It's definitely clear that a shared library version of clang should
always use clang with the custom data macro defined to nothing.

This stated, it is nonetheless clear that the ability to attach
arbitrary data to AST nodes (for specifically compiled versions of the
library) is of great usefulness.

The library is less general but using a map it would be much less efficient.

Note that the reason to request this macro is the consideration that the
unconditional insertion of a void* in these nodes would be unacceptable
for those applications that does not need that.

The three vertices of the tradeoff triangle are efficiency vs.
generality vs. memory usage. Implementing the mechanism described in the
proposal we let the client decide what matters most for its own needs.

Hi Abramo,

LLVM heavily uses the "map on the side" approach. This is how optimization passes associate data with IR objects etc. The LLVM "DenseMap" datastructure is heavily optimized for this use, we don't want void*'s on AST nodes.

-Chris

Chris Lattner ha scritto:

We still don't want it, a DenseMap works great.

-Chris

Can you categorize this for us? Do up your full app using each approach, then benchmark it on some real data. Let us know the % slowdown. If large, we might be able to come up with a better design that lets you get more performance. For now, we just tend to doubt the benefit is worth the cost.

There are other ways to get the pointers, for example, you can visit things of interest, numbering each as you go sequentially. Then, you can use this number as the index into an array to directly get at the data on the side, no searching and as dense as you want. The cost, an extra parameter on visitors and one index for each node reference you might want to randomly walk to.

Mike Stump ha scritto:

The library is less general but using a map it would be much less
efficient.

Can you categorize this for us? Do up your full app using each
approach, then benchmark it on some real data. Let us know the %
slowdown. If large, we might be able to come up with a better design
that lets you get more performance. For now, we just tend to doubt the
benefit is worth the cost.

With the attached program I've benchmarked std::map vs llvm::DenseMap vs
embedded data, reproducing our typical use.

$ time ./a.out 100000 1000 0

real 0m15.421s
user 0m15.415s
sys 0m0.005s
$ time ./a.out 100000 1000 1

real 0m1.344s
user 0m1.337s
sys 0m0.006s

$ time ./a.out 100000 1000 2

real 0m0.390s
user 0m0.387s
sys 0m0.003s

The test build a collection of 100000 objects and link an int to each
object. After that it makes 1000 lookup for each object to retrieve the
data.

llvm::DenseMap are 11.5 times faster than std::map and embedded data is
3.5 times faster than llvm::DenseMap.

The ratio between embedded data and Densemap is not insignificant but I
confess that I guessed that the performance ratio between embedded data
and DenseMap was far higher.

Many thanks to Mike, Chris and Steve for their help and suggestions.

benchmark_maps.cc (1.6 KB)