parsing without building an AST tree

Hi

We're writing a clang-based tagger and while trying to improve the
performance of our solution we came upon this paragraph:

"Elsa is not built as a stack of reusable libraries like clang is. It
is very difficult to use part of Elsa without the whole front-end. For
example, you cannot use Elsa to parse C/ObjC code without building an
AST. You can do this in Clang and it is much faster than building an
AST."

from here: http://clang.llvm.org/comparison.html

We've been using the C-api in clang-c/Index.h but if we could get
better performance by using the C++ APIs directly we'd gladly do so
(even if it might change or be harder to use).

Is there an example or some documentation on how to do this somewhere possibly?

regards

Anders

Hi

We're writing a clang-based tagger and while trying to improve the
performance of our solution we came upon this paragraph:

Not sure what your requirements for a "tagger" are, would be curious :slight_smile:

"Elsa is not built as a stack of reusable libraries like clang is. It
is very difficult to use part of Elsa without the whole front-end. For
example, you cannot use Elsa to parse C/ObjC code without building an
AST. You can do this in Clang and it is much faster than building an
AST."

from here: http://clang.llvm.org/comparison.html

We've been using the C-api in clang-c/Index.h but if we could get
better performance by using the C++ APIs directly we'd gladly do so
(even if it might change or be harder to use).

Is there an example or some documentation on how to do this somewhere possibly?

You can use the clang preprocessor to tokenize if that's all you need.
Currently there's not really good docs around that, and I don't think
I have a really good example. I can get you some more ideas on how to
go about this if you say that preprocessor-only is what you need.

Cheers,
/Manuel

Hi Manuel

Well. We essentially provide a client/server setup where an editor can
pass a location (file,offset) to the server and some options and the
server can respond with various information. Most importantly
references to this location (from all the files we've indexed) and
whatever it refers to. This is to be able to do 21st century things
like "follow symbol" and "find references" in Emacs since I'll never
switch to an IDE. We need to be able to visit cursors I and ask them
what they reference I guess. Not sure if this would be possible with
the preprocess-only option. Likely not I guess. If you could point me
at an example on how to do the preprocessing only I'd love to have a
look.

If you want to take a look at the project it can be found here:

https://github.com/Andersbakken/rtags

thanks

Hi Manuel

Well. We essentially provide a client/server setup where an editor can
pass a location (file,offset) to the server and some options and the
server can respond with various information. Most importantly
references to this location (from all the files we've indexed) and
whatever it refers to. This is to be able to do 21st century things
like "follow symbol" and "find references" in Emacs since I'll never
switch to an IDE. We need to be able to visit cursors I and ask them
what they reference I guess. Not sure if this would be possible with
the preprocess-only option. Likely not I guess. If you could point me
at an example on how to do the preprocessing only I'd love to have a
look.

I think for your use case you really need the fully type-resolved AST.
This also means that there is no faster way to do it than to parse the
C++ code. The way you can save time is by doing aggressive in-memory
caching of processed parts of the file, which is one thing Chandler is
planning to work on (we call that "clangd" for Clang daemon).

You can take a look at:
http://clang.llvm.org/docs/Tooling.html
to see the various possibilities you currently have to integrate with
clang here.

Cheers,
/Manuel

Have you seen this? http://lists.cs.uiuc.edu/pipermail/cfe-dev/2012-June/022028.html

Hi Manuel

Thanks for the info. We didn't know about libtooling. It seems like
there will be a lot of overlap between what we do and what clangd will
do. Our approach has been not to persist translation units but rather
tear out the information we need when we parse it and reparse when we
need to. The main reasons we've found for this:

1) clang_reparse doesn't seem much (any) faster than parsing the whole
thing over again.
2) clang apis do not seem to give us a way to find references for a
given cursor across translation units.

We'll be watching the project though when code starts appearing
though. I imagine APIs will pop up in Index.h as they are needed by
clangd.

regards

Anders

Hi Manuel

Thanks for the info. We didn't know about libtooling. It seems like
there will be a lot of overlap between what we do and what clangd will
do. Our approach has been not to persist translation units but rather
tear out the information we need when we parse it and reparse when we
need to. The main reasons we've found for this:

1) clang_reparse doesn't seem much (any) faster than parsing the whole
thing over again.

As far as I understand it reparsing only gets faster if you store a
precompiled preamble of the source files in between runs.

2) clang apis do not seem to give us a way to find references for a
given cursor across translation units.

USRs are made for that; I assume you've seen:
http://clang.llvm.org/doxygen/group__CINDEX__CURSOR__XREF.html

We'll be watching the project though when code starts appearing
though. I imagine APIs will pop up in Index.h as they are needed by
clangd.

I don't think python APIs will appear first - the clangd project has
C++ clients as a first goal. (Python clients are a core goal, too, but
not as high prio I think).

Cheers,
/Manuel

--

Anders

Hi Manuel

Thanks for the info. We didn't know about libtooling. It seems like
there will be a lot of overlap between what we do and what clangd will
do. Our approach has been not to persist translation units but rather
tear out the information we need when we parse it and reparse when we
need to. The main reasons we've found for this:

1) clang_reparse doesn't seem much (any) faster than parsing the whole
thing over again.

As far as I understand it reparsing only gets faster if you store a
precompiled preamble of the source files in between runs.

We did pass those flags to the initial clang_parseTranslationUnit call I believe. I could take another look.

Not having first-hand knowledge here, but having heard from multiple
people who tried this: it doesn't seem to be easy :wink:

2) clang apis do not seem to give us a way to find references for a
given cursor across translation units.

USRs are made for that; I assume you've seen:
http://clang.llvm.org/doxygen/group__CINDEX__CURSOR__XREF.html

I"ve seen those Apis. We use them for certain things but I am not sure how that would help for this. E.g. Suppose I want all calls to printf across all my source files.

The idea would be that you point at the printf you want in one of your
TUs, and create the USR from that - then you can look up the USR in
your database. Obviously you might want to store the qualified name,
too, so users can do more inclusive textual queries without needing to
go through an existing TU. But for the queries where you already are
in a source file, the additional preciseness of the USR would seem
important to me.

Cheers,
/Manuel