source code database

The "open clang projects" page refers to some potential uses of clang
for tool-building. A few of them require metadata from the
lexer or parser.

I'm interested in creating a framework for searching and reporting on
large C++ code trees. I wonder what work has already been done, and if
the information I want is currently available from the clang front
end. I would begin by capturing the token metadata in SQLite, thereby
making them accessible to a variety of applications.


Back when the VAX dinosaur was knee-high to a mammal, I used DEC's
Source Code Analyzer (SCA)[1]. To this day, I have never seen or heard
of anything as good. ISTM clang could be used to create something

What is "as good", and what would be better?

SCA let the user:

1. analyze arbitrary subsets of a source code tree
2. dynamically restrict the range of queries on that subset
3. distinguish among read, write, invoke, reference, and dereference
4. define "interesting" cases for repeated use, including reports

  Current Tools Fail

Microsoft's tool lacks all these features. cscope has some of them,
but only for C. (For example, cscope cannot search for a
destructor or anything with a scope operator.) VS parses C++, but the
user cannot search for uses of e.g. operator<<.

The free tools I've looked at share don't really parse C++. They
parse the nonlanguage "C/C++". Consequently they cannot hope to
answer #3 above; they can't even distinguish between ::B and A::B.
They also lack any kind of scripting language, preventing #4 and
severely restricting the capability of #2.

These problems are all answered by clang+SQL. Or, might be, if clang
is up to the job.

  Required Metadata

I'm sure the following is incomplete and that it is more
comprehensive than what is available from any existing tool at any
price. Is it covered by clang at present?


For any token

1. namespace
2. enclosing class/struct
3. const, static
4. linkage
5. public, protected, or private (or none)
6. declare, define, or use
7. translation unit (file) and line number

It should be possible to say in which lines of a file a given token
is visible.

For types

1. class, struct, or enum
2. derived from
3. derived how (public/protected/private)

For typedefs, the above must be available for all components of the

For variables

1. read, write, invoke, reference, and dereference
    (A variable may be invoked if it holds a pointer to a function.)
2. type: class, struct, typedef, or builtin
3. const, static, or automatic
4. (overrides can be derived)
5. for uses, discarded Koenig lookups

For functions

1. for each parameter and return type, cf. "for variables", above
2. invoke or reference
3. (overrides can be derived)
4. for invocations, discarded Koenig lookups

For operators

1. declare, define, reference, or invoke
2. friendship (1 : many)
3. for invocations, discarded Koenig lookups

For the preprocessor

1. define or use
2. scope
3. post-processing interpretation, as above


As I said, I would like to know if the above information is accessible
from the clang "kit" and what, if anything, has been undertaken in this
vicinity heretofore. If clang can provide the information, the project
I have in mind -- of writing a tool to collect it and keep it in a
database -- is both useful and feasible.

It's a big question, I know. You can appreciate I'd want to know the
feasibility first, before diving in.

Thank you for your time.



P.S. Prior to posting, I tried to read the mailing list archives. I
must not be the first to notice they're almost impossible to read
because the text doesn't wrap in the browser.

clang should support much of what you ask for.

DXR ( ) is an existing attempt to use
clang to build a program database. is
some old hack from me that does the same in worse - but since there's
a lot less code, maybe it's easier for a first look (relevant file:


This is right up your alley:

Nico Weber <> writes:

clang should support much of what you ask for.

DXR ( ) is an existing attempt to use
clang to build a program database. is
some old hack from me that does the same in worse - but since there's
a lot less code, maybe it's easier for a first look (relevant file:

dxr-index.cpp is ~600 lines, so it's not enormous. It seems non-trivial
to get everything set up. Easy enough to build, though.

(Interestingly it's a clang plugin rather than a standalone using
libclang. (As is include-what-you-use.)

I wonder if that's a trend: that while it seems attractive to build a
standalone against a library with a reasonably stable and documented C
API, in practice it's easier to build a plugin because that lets you use
a richer set of classes and (likely more interesting?) makes it easier
to plug in to existing build systems, so it'll be easier for people to
use in their own projects.)


Currently running as a plugin is the easiest way to run a tool over code.
I have a patch out under review which makes it simpler to write a
standalone tool; whether you want a standalone tool or a plugin
depends on the use cases you have - the nice thing about standlone
tools is that they're orthogonal to the build system, and you can run
them over files that normally would not be rebuilt selectively.
For CMake there's already support to output the compile commands in a
parseable form so that standalone tools can run over arbitrary files.

for more info.