Clang indexing library performance

I was playing with libclang-based indexing and clang_complete plugin for Vim.
Having found completion responsiveness a bit lacking, I decided to track down whether it's plugin's fault.

To put the indexer under stress, I've made up an artificial test with the biggest and baddest C++ headers:

---------------8<---------------
#include <boost/asio.hpp>
#include <boost/regex.hpp>
#include <boost/thread.hpp>
#include <boost/spirit.hpp>
#include <boost/signals2.hpp>
boost::
---------------8<---------------

...and the following minimal test program using Python bindings:

---------------8<---------------
from clang import cindex

def clock(f):
    import time
    def wrapf(*args,**kw):
        start = time.time()
        f(*args,**kw)
        end = time.time()
        print f.func_name, end - start
    return wrapf

fname='...path to file...'
idx=cindex.Index.create()

@clock
def parse():
    global tu
    opts = ['-I/opt/local/include'] # MacPorts Boost path
    tu=idx.parse(fname,opts)

@clock
def complete():
    c = tu.codeComplete(fname,6,8)

parse()
for i in range(4):
    complete()
---------------8<---------------

This is the timing I get on a Macbook Pro:

parse 3.96061992645
complete 3.31106615067
complete 3.17438578606
complete 3.37997102737
complete 3.16793084145

Each one individually isn't *too* bad, taking under 1s, though still not too responsive, with exception of signals2.hpp taking over 1.5s and, and less exotic thread.hpp taking about 1.2s.

So, the questions:

Am I misusing the library? Should I try passing some additional flags or anything else? Or it's just the way it is? What puzzles me is that completion takes time comparable to parsing. I take it headers have to be re-parsed in case the order of inclusion changes, as it may potentially change header contents, but 3s still seems a bit excessive.

I'd appreciate any clues.

I was playing with libclang-based indexing and clang_complete plugin for Vim.
Having found completion responsiveness a bit lacking, I decided to track down whether it’s plugin’s fault.

To put the indexer under stress, I’ve made up an artificial test with the biggest and baddest C++ headers:

---------------8<---------------
#include <boost/asio.hpp>
#include <boost/regex.hpp>
#include <boost/thread.hpp>
#include <boost/spirit.hpp>
#include <boost/signals2.hpp>
boost::
---------------8<---------------

…and the following minimal test program using Python bindings:

---------------8<---------------
from clang import cindex

def clock(f):
import time
def wrapf(*args,**kw):
start = time.time()
f(*args,**kw)
end = time.time()
print f.func_name, end - start
return wrapf

fname=’…path to file…’
idx=cindex.Index.create()

@clock
def parse():
global tu
opts = [’-I/opt/local/include’] # MacPorts Boost path
tu=idx.parse(fname,opts)

@clock
def complete():
c = tu.codeComplete(fname,6,8)

parse()
for i in range(4):
complete()
---------------8<---------------

This is the timing I get on a Macbook Pro:

parse 3.96061992645
complete 3.31106615067
complete 3.17438578606
complete 3.37997102737
complete 3.16793084145

Each one individually isn’t too bad, taking under 1s, though still not too responsive, with exception of signals2.hpp taking over 1.5s and, and less exotic thread.hpp taking about 1.2s.

So, the questions:

Am I misusing the library? Should I try passing some additional flags or anything else?

You want to use the “default editing options” when parsing the translation unit

clang_defaultEditingTranslationUnitOptions()

and then reparse at least once. That will enable the various code-completion optimizations that should bring this time down significantly.

Or it’s just the way it is? What puzzles me is that completion takes time comparable to parsing.

That’s because it is parsing :slight_smile:

I take it headers have to be re-parsed in case the order of inclusion changes, as it may potentially change header contents, but 3s still seems a bit excessive.

Given the volume of code you’re parsing, I wouldn’t call 3s excessive. On my system, GCC takes 2.7s, Clang normally takes 2.3s, and Clang when parsing for code completion takes 1.6s (it takes some shortcuts).

If you then turn on the optimizations I mentioned above, it goes down to 0.4s. Of course, it could always be better.

  • Doug

I was playing with libclang-based indexing and clang_complete plugin for Vim.
Having found completion responsiveness a bit lacking, I decided to track down whether it's plugin's fault.

To put the indexer under stress, I've made up an artificial test with the biggest and baddest C++ headers:

---------------8<---------------
#include<boost/asio.hpp>
#include<boost/regex.hpp>
#include<boost/thread.hpp>
#include<boost/spirit.hpp>
#include<boost/signals2.hpp>
boost::
---------------8<---------------

...and the following minimal test program using Python bindings:

---------------8<---------------
from clang import cindex

        PrecompiledPreamble = 0x04
        CacheCompletionResults = 0x08
        CXXPrecompiledPreamble = 0x10
        flags = PrecompiledPreamble |
                CXXPrecompiledPreamble |
                CacheCompletionResults
  > def clock(f):

     import time
     def wrapf(*args,**kw):
         start = time.time()
         f(*args,**kw)
         end = time.time()
         print f.func_name, end - start
     return wrapf

fname='...path to file...'
idx=cindex.Index.create()

@clock
def parse():
     global tu
     opts = ['-I/opt/local/include'] # MacPorts Boost path
     tu=idx.parse(fname,opts)

        tu=idx.parse(fname,opts, [], flags)

This should give you a lot better results. You need to define these flags by yourself, as they are not yet available in the official python bindings.

@clock
def complete():
     c = tu.codeComplete(fname,6,8)

Maybe using this may help even further:
        c = tu.codeComplete(fname,6,8, flags)

parse()
for i in range(4):
     complete()
---------------8<---------------

This is the timing I get on a Macbook Pro:

parse 3.96061992645
complete 3.31106615067
complete 3.17438578606
complete 3.37997102737
complete 3.16793084145

Can you send me your new results, either only with the parse() change, and then with both the parse() and the complete() change.

Thanks and cheers
Tobi

And regarding clang_complete itself:

There are definitely some remaining performance problems in clang_complete itself. Try the attached two patches and check if you see some improvements.

You can get timing information in the status row, by setting
'g:clang_debug' to 1.

Can you report the timing for clang_complete, before and after these patches. It should have improved significantly.

Furthermore, if there is a large number of suggested completions clang_complete may take significant time to format them. This definitely could need some improvements too.

Cheers
Tobi

0001-libclang-Enable-TranslationUnit.CacheCompletionResul.patch (1.03 KB)

0002-libclang-Do-not-reparse-the-file-right-after-parsing.patch (1.17 KB)

It improves things a little, but not by much.

Before:

LibClang - First parse: 3.21348905563
LibClang - First reparse (generate PCH cache): 6.03147506714
LibClang - Code completion time: 2.16154503822
clang_complete: completion time (library) 11.986595
LibClang - Code completion time: 2.20204496384
clang_complete: completion time (library) 2.777556
LibClang - Code completion time: 2.21751308441
clang_complete: completion time (library) 2.794342

After:

LibClang - First parse: 3.13683891296
LibClang - Code completion time: 2.13823485374
clang_complete: completion time (library) 5.856766
LibClang - Code completion time: 2.18497300148
clang_complete: completion time (library) 2.758560
LibClang - Code completion time: 2.19934606552
clang_complete: completion time (library) 2.775554

I have set the appropriate flags and tried re-parsing, but I get nowhere near 0.4s completion time (which would be fantastic, of course).
To be completely clear, I'm using official 2.9 release.

parse only:

parse 3.55141305923
complete 2.60280609131
complete 2.53132414818
complete 2.54255604744
complete 2.50854682922

parse + complete:

parse 3.58968901634
complete 2.62985301018
complete 2.55927109718
complete 2.54651212692
complete 2.55211901665

I have also tried to eliminate the middlemen completely and use C interface directly using this tool: https://gist.github.com/758615 from this clang_complete bug report: https://github.com/Rip-Rip/clang_complete/issues/17 which you have already read, as I can see.

I still get about 2.5s completion time even when setting parse and completion flags to clang_defaultEditingTranslationUnitOptions() and clang_defaultCodeCompleteOptions() respectively.

> Can you send me your new results, either only with the parse() change, and then with both the parse() and the complete() change.
>
> Thanks and cheers
> Tobi

parse only:

parse 3.55141305923
complete 2.60280609131
complete 2.53132414818
complete 2.54255604744
complete 2.50854682922

parse + complete:

parse 3.58968901634
complete 2.62985301018
complete 2.55927109718
complete 2.54651212692
complete 2.55211901665

OK. This is not very convincing.

I have also tried to eliminate the middlemen completely and use C interface directly using this tool:Hacking with libclang completion. · GitHub from this clang_complete bug report:Improve code completion performance · Issue #17 · xavierd/clang_complete · GitHub which you have already read, as I can see.

OK. Let's get a common baseline. Can you run the complete tool (from the bug report) on the attached boost.cc file (taken from the very same bug report, you need boost installed) and compare it with my results.

boost.cc (353 Bytes)

I may have omitted one crucial detail: I run 2.9 release.
I've run your test case with 2.9 and svn build and I do indeed get timing comparable to yours.

-------------------8<-------------------
Parsing boost.cc: 1.3869 (100.0%) 0.2169 (100.0%) 1.6038 (100.0%) 1.6073 (100.0%)
Precompiling preamble: 2.1251 (100.0%) 0.4021 (100.0%) 2.5273 (100.0%) 2.7308 (100.0%)
Cache global code completions for boost.cc: 0.4364 (100.0%) 0.0050 (100.0%) 0.4413 (100.0%) 0.4414 (100.0%)
Reparsing boost.cc: 2.6944 (100.0%) 0.5080 (100.0%) 3.2024 (100.0%) 3.4065 (100.0%)
Code completion @ boost.cc:16:10: 0.1284 (100.0%) 0.0418 (100.0%) 0.1702 (100.0%) 0.1703 (100.0%)
-------------------8<-------------------
Parsing boost.cc: 1.4944 (100.0%) 0.1213 (100.0%) 1.6157 (100.0%) 1.6160 (100.0%)
Precompiling preamble: 2.0441 (100.0%) 0.2648 (100.0%) 2.3089 (100.0%) 2.3872 (100.0%)
Cache global code completions for boost.cc: 0.5091 (100.0%) 0.0047 (100.0%) 0.5138 (100.0%) 0.5138 (100.0%)
Reparsing boost.cc: 2.6945 (100.0%) 0.3137 (100.0%) 3.0082 (100.0%) 3.0866 (100.0%)
Code completion @ boost.cc:16:10: 0.1372 (100.0%) 0.0275 (100.0%) 0.1647 (100.0%) 0.1651 (100.0%)
-------------------8<-------------------

However... for my original test case, I get almost 4x speedup!

-------------------8<-------------------
Parsing f.cc: 2.9795 (100.0%) 0.5403 (100.0%) 3.5199 (100.0%) 3.5210 (100.0%)
Precompiling preamble: 4.6721 (100.0%) 1.0640 (100.0%) 5.7361 (100.0%) 6.4709 (100.0%)
Cache global code completions for f.cc: 1.0227 (100.0%) 0.0065 (100.0%) 1.0292 (100.0%) 1.0324 (100.0%)
Reparsing f.cc: 6.2431 (100.0%) 1.3625 (100.0%) 7.6056 (100.0%) 8.3611 (100.0%)
Code completion @ f.cc:6:8: 2.0723 (100.0%) 0.4668 (100.0%) 2.5391 (100.0%) 2.5500 (100.0%)
-------------------8<-------------------
Parsing f.cc: 3.2017 (100.0%) 0.2806 (100.0%) 3.4822 (100.0%) 3.4856 (100.0%)
Precompiling preamble: 4.2331 (100.0%) 0.5612 (100.0%) 4.7942 (100.0%) 5.0883 (100.0%)
Cache global code completions for f.cc: 1.2300 (100.0%) 0.0061 (100.0%) 1.2362 (100.0%) 1.2362 (100.0%)
Reparsing f.cc: 6.0510 (100.0%) 0.7061 (100.0%) 6.7571 (100.0%) 7.0513 (100.0%)
Code completion @ f.cc:6:8: 0.6013 (100.0%) 0.0895 (100.0%) 0.6908 (100.0%) 0.6909 (100.0%)
-------------------8<-------------------

So, I guess I'll just have to use the svn build until the next Clang release.
There is still overhead in clang_complete itself, but that's not clang's problem.

Thanks.

I was playing with libclang-based indexing and clang_complete plugin for Vim.
Having found completion responsiveness a bit lacking, I decided to track down whether it's plugin's fault.

To put the indexer under stress, I've made up an artificial test with the biggest and baddest C++ headers:

---------------8<---------------
#include <boost/asio.hpp>
#include <boost/regex.hpp>
#include <boost/thread.hpp>
#include <boost/spirit.hpp>
#include <boost/signals2.hpp>
boost::
---------------8<---------------

...and the following minimal test program using Python bindings:

---------------8<---------------
from clang import cindex

def clock(f):
  import time
  def wrapf(*args,**kw):
      start = time.time()
      f(*args,**kw)
      end = time.time()
      print f.func_name, end - start
  return wrapf

fname='...path to file...'
idx=cindex.Index.create()

@clock
def parse():
  global tu
  opts = ['-I/opt/local/include'] # MacPorts Boost path
  tu=idx.parse(fname,opts)

@clock
def complete():
  c = tu.codeComplete(fname,6,8)

parse()
for i in range(4):
  complete()
---------------8<---------------

This is the timing I get on a Macbook Pro:

parse 3.96061992645
complete 3.31106615067
complete 3.17438578606
complete 3.37997102737
complete 3.16793084145

Each one individually isn't *too* bad, taking under 1s, though still not too responsive, with exception of signals2.hpp taking over 1.5s and, and less exotic thread.hpp taking about 1.2s.

So, the questions:

Am I misusing the library? Should I try passing some additional flags or anything else?

You want to use the "default editing options" when parsing the translation unit

   clang_defaultEditingTranslationUnitOptions()

and then reparse at least once. That will enable the various code-completion optimizations that should bring this time down significantly.

Or it's just the way it is? What puzzles me is that completion takes time comparable to parsing.

That's because it *is* parsing :slight_smile:

I take it headers have to be re-parsed in case the order of inclusion changes, as it may potentially change header contents, but 3s still seems a bit excessive.

Given the volume of code you're parsing, I wouldn't call 3s excessive. On my system, GCC takes 2.7s, Clang normally takes 2.3s, and Clang when parsing for code completion takes 1.6s (it takes some shortcuts).

If you then turn on the optimizations I mentioned above, it goes down to 0.4s. Of course, it could always be better.

   - Doug

I have set the appropriate flags and tried re-parsing, but I get nowhere near 0.4s completion time (which would be fantastic, of course).
To be completely clear, I'm using official 2.9 release.

Ah. Some of these optimizations were not available for C++ in that release.

I may have omitted one crucial detail: I run 2.9 release.
I've run your test case with 2.9 and svn build and I do indeed get timing comparable to yours.

-------------------8<-------------------
Parsing boost.cc: 1.3869 (100.0%) 0.2169 (100.0%) 1.6038 (100.0%) 1.6073 (100.0%)
Precompiling preamble: 2.1251 (100.0%) 0.4021 (100.0%) 2.5273 (100.0%) 2.7308 (100.0%)
Cache global code completions for boost.cc: 0.4364 (100.0%) 0.0050 (100.0%) 0.4413 (100.0%) 0.4414 (100.0%)
Reparsing boost.cc: 2.6944 (100.0%) 0.5080 (100.0%) 3.2024 (100.0%) 3.4065 (100.0%)
Code completion @ boost.cc:16:10: 0.1284 (100.0%) 0.0418 (100.0%) 0.1702 (100.0%) 0.1703 (100.0%)
-------------------8<-------------------
Parsing boost.cc: 1.4944 (100.0%) 0.1213 (100.0%) 1.6157 (100.0%) 1.6160 (100.0%)
Precompiling preamble: 2.0441 (100.0%) 0.2648 (100.0%) 2.3089 (100.0%) 2.3872 (100.0%)
Cache global code completions for boost.cc: 0.5091 (100.0%) 0.0047 (100.0%) 0.5138 (100.0%) 0.5138 (100.0%)
Reparsing boost.cc: 2.6945 (100.0%) 0.3137 (100.0%) 3.0082 (100.0%) 3.0866 (100.0%)
Code completion @ boost.cc:16:10: 0.1372 (100.0%) 0.0275 (100.0%) 0.1647 (100.0%) 0.1651 (100.0%)
-------------------8<-------------------

Great. You seem to have a machine that is even faster than mine.

However... for my original test case, I get almost 4x speedup!

-------------------8<-------------------
Parsing f.cc: 2.9795 (100.0%) 0.5403 (100.0%) 3.5199 (100.0%) 3.5210 (100.0%)
Precompiling preamble: 4.6721 (100.0%) 1.0640 (100.0%) 5.7361 (100.0%) 6.4709 (100.0%)
Cache global code completions for f.cc: 1.0227 (100.0%) 0.0065 (100.0%) 1.0292 (100.0%) 1.0324 (100.0%)
Reparsing f.cc: 6.2431 (100.0%) 1.3625 (100.0%) 7.6056 (100.0%) 8.3611 (100.0%)
Code completion @ f.cc:6:8: 2.0723 (100.0%) 0.4668 (100.0%) 2.5391 (100.0%) 2.5500 (100.0%)
-------------------8<-------------------
Parsing f.cc: 3.2017 (100.0%) 0.2806 (100.0%) 3.4822 (100.0%) 3.4856 (100.0%)
Precompiling preamble: 4.2331 (100.0%) 0.5612 (100.0%) 4.7942 (100.0%) 5.0883 (100.0%)
Cache global code completions for f.cc: 1.2300 (100.0%) 0.0061 (100.0%) 1.2362 (100.0%) 1.2362 (100.0%)
Reparsing f.cc: 6.0510 (100.0%) 0.7061 (100.0%) 6.7571 (100.0%) 7.0513 (100.0%)
Code completion @ f.cc:6:8: 0.6013 (100.0%) 0.0895 (100.0%) 0.6908 (100.0%) 0.6909 (100.0%)
-------------------8<-------------------

So, I guess I'll just have to use the svn build until the next Clang release.

Still quiet a long time, but a lot faster than before. Do you think this works for you or do you plan to look into further possibilities for speedup.

There is still overhead in clang_complete itself, but that's not clang's problem.

Yes, as soon as the number of completion results is large clang_complete becomes notably slower. If you plan to look into this I am glad to help you here.

Cheers
Tobi

2011/10/3 Tobias Grosser <tobias@grosser.es <mailto:tobias@grosser.es>>

Thanks, I think there are no low-hanging fruit left on Clang's side. If
you know of any, feel free to share. :slight_smile:

I'll probably be writing my own completer in the medium term to replace
clang_complete's Python implementation (which now takes 50% of the time
trying to complete things like "boost::") in C++. I already
tried optimizing Python version, but there is only so much you can do.

So did you find out where the bottleneck in the clang_complete's python implementation is?

I could get Python implementation from ~600ms overhead to ~160ms with some dirty code (patch attached). CPython does very few optimizations (if any at all), so all the redundant iterations over the list of chunks in a completion string create a lot of temporary objects and make a lot of calls via ctypes. I unrolled all the chunk information from completion strings before doing any transformations.

My limited investigation has shown that it's just slow with a large number of results. Limiting the results to about 10 gives almost zero overhead.

Yes, it works well on small datasets, but I work with boost and use Supertab, so I often happen to trigger autocompletion, and it often happes to be with boost namespace, which contains half a universe. One way to avoid it would be to stop using Supertab and use C-x C-o, but it's a workaround, not a solution.

To me it seems the sorting and filtering is very inefficient.
I did not yet look into how to improve it, but believe there should be some possibilities, e.g. by filtering unneeded results early.

I though about filtering early but it doesn't seem to work because Clang completion is triggered only once, the rest of the filtering is handled by Vim itself as you type. So if you don't have the full list and happen to need boost::weak_ptr, which, being at the bottom of the list of 10,000 items was filtered out, you'd have to re-trigger Clang completion for every keystroke.

I also tried to make a lazy dict for each result object from a completion string, but that wouldn't work as clang_complete has to realize the entire list anyway when marshalling the result into VimScript land.

As you seem to plan your own implementation. What would you do different to overcome the performance problems you found in the python implementation?

Cheers
Tobi

Eliminating Python for filtering and formatting would be a good step, I think? :slight_smile: One could try to optimize it further, but to me it seems like an exercise in frustration. I didn't yet investigate how much of the time is spent in VimScript, but I think it can be done better than 150ms. Another thing is trying to start indexing in the background maybe, this first parse/reparse step is painfully long. An indexing daemon process with some IPC layer, perhaps?

Regards,
Alex.

PS

Forgot to CC the mailing list the first time around.

clang_complete_hack.patch (3.94 KB)