RFC: Improving the performance of ItaniumDemangle

(Again), while trying to improve the performance of lldb, I ran into a bottleneck with the demangler. This may be specific to my platform - Ubuntu 16.04, probably using libstdc++, not libc++. It makes extensive use of std::string and std::vector, and I see memory allocation at the top. I prototyped a version that uses an arena-style memory allocator (you can allocate, but you can’t ever free). It is approximately 14+% faster. I think I can further optimize it by making repeated appends zero-copy (for the string being appended too).

The code right now is a little ugly, because it uses a thread local variable to pass around the arena pointer, rather than change every + and += to be function calls that take db.arena as a parameter. I’m not sure what you guys would prefer for that either (thread local variable vs api change).

I thought the plan of record was (r280732):

'''
Once the fast demangler in lldb can handle any names this
implementation can be replaced with it and we will have the one true
demangler.
'''

What is the status of lldb's fast demangler? Is it available on Ubuntu 16.04?

vedant

well, top-of-branch lldb uses this code, that’s how I found it. Do you mean libc++'s demangler?

FYI when I said 14+% (and now it’s 17%), I mean the overall performance of starting lldb, not just the demangler itself. It’s probably several times faster now with this change (https://reviews.llvm.org/D32500)

well, top-of-branch lldb uses this code, that's how I found it. Do you mean libc++'s demangler?

Thanks for explaining, this is the first time I'm looking at the demangler situation. It looks like libcxxabi has an arena-based demangler, and that the one in llvm is different.

I'm confused by this because the comment in llvm says that libcxxabi is supposed to reuse the llvm demangler. This doesn't seem to be happening, right?

FYI when I said 14+% (and now it's 17%), I mean the overall performance of starting lldb, not just the demangler itself. It's probably several times faster now with this change (⚙ D32500 Optimize ItaniumDemangle by using an arena allocator)

Do you know what the llvm policy is on using TLS in library code? I can't find any mention of this in the programmer's manual, and my officemates don't know either.

vedant

>
> well, top-of-branch lldb uses this code, that's how I found it. Do you
mean libc++'s demangler?

Thanks for explaining, this is the first time I'm looking at the demangler
situation. It looks like libcxxabi has an arena-based demangler, and that
the one in llvm is different.

I'm confused by this because the comment in llvm says that libcxxabi is
supposed to reuse the llvm demangler. This doesn't seem to be happening,
right?

I'm confused too. I'm new here :slight_smile:

> FYI when I said 14+% (and now it's 17%), I mean the overall performance
of starting lldb, not just the demangler itself. It's probably several
times faster now with this change (⚙ D32500 Optimize ItaniumDemangle by using an arena allocator)

Do you know what the llvm policy is on using TLS in library code? I can't
find any mention of this in the programmer's manual, and my officemates
don't know either.

I don't know, and frankly I don't like using it. It was more "to get the
conversation started." I can change all the string routines to take the
arena as a parameter, it'll just make the diff look larger.

But if libcxxapi has already done the heavy lifting then maybe I should
just benchmark their demangler instead.

+ 1, with a caveat. If it turns out that the demangler in libcxxabi is faster, but that it can't be incorporated into llvm for some (unknown to me) reason, the benchmarking may be wasted effort. It would be good to ping some people who know what's up with the demanglers.

vedant

>
> well, top-of-branch lldb uses this code, that's how I found it. Do you
mean libc++'s demangler?

Thanks for explaining, this is the first time I'm looking at the demangler
situation. It looks like libcxxabi has an arena-based demangler, and that
the one in llvm is different.

I'm confused by this because the comment in llvm says that libcxxabi is
supposed to reuse the llvm demangler. This doesn't seem to be happening,
right?

This seems correct. libcxxabi demangler [1] is different from the one used
by llvm [2]. I'm hoping Saleem, Eric or Jon (copied) knows a bit of history
as to why this is so (perhaps because the two projects evolved
independently ?).

> FYI when I said 14+% (and now it's 17%), I mean the overall performance
of starting lldb, not just the demangler itself. It's probably several
times faster now with this change (⚙ D32500 Optimize ItaniumDemangle by using an arena allocator)

Do you know what the llvm policy is on using TLS in library code? I can't
find any mention of this in the programmer's manual, and my officemates
don't know either.

Both libcxx and libcxxabi use __libcpp_tls_*() functions of the threading
API [2] (which call pthread functions on most platforms) for thread-local
storage needs. IIRC thread_local is not implemented across all the
platforms that llvm support.

If the idea is to improve libcxxabi's demangler, then it should be
straightforward to use these functions instead of thread_local.

[1] GitHub - llvm-mirror/libcxxabi: Mirror kept for legacy. Moved to https://github.com/llvm/llvm-project
src/cxa_demangle.cpp
[2]
https://github.com/llvm-mirror/llvm/blob/master/lib/Demangle/ItaniumDemangle.cpp
[3] GitHub - llvm-mirror/libcxx: Project moved to: https://github.com/llvm/llvm-project
include/__threading_support

PS: Here's a particularly amusing bug of the current libcxxabi demangler:
https://bugs.llvm.org//show_bug.cgi?id=31031

Cheers,

/ Asiri

A copy of the libcxxabi one was added to libSupport relatively recently so that lldb could use it… so they should be almost the same. lldb has its own because it has different constraints w.r.t. memory allocation and speed compared to the _cxa* one. (I don’t know much about the details there though). If falls back on the _cxa* implementation for some cases where the “fast” one’s implementation is incomplete (again, repeating what I remember… I don’t know the details). Jon

The libcxxapi demangler doesn’t look any faster than the llvm demangler. It still suffers from excessive use of malloc.

I didn't realize lldb had its own demangler. It must not be very thorough,
because my lldb session was falling back to llvm's demangler quite a lot!

Without my change, disabling lldb's FastDemangler is ~10% slower.
With my change, disabling lldb's FastDemangler is ~1.25% slower.
(as measured by perf stat running lldb, # of cycles. Interesting, the
instruction count difference is much larger, implying lldb's demangler has
very poor IPC).

The one in llvm required a few changes to be more portable. If it can
be made faster that is a good thing.

If possible you should make the change in libcxxabi and copy the code
to llvm given the license difference between the two.

Cheers,
Rafael

>
> well, top-of-branch lldb uses this code, that's how I found it. Do you
mean libc++'s demangler?

Thanks for explaining, this is the first time I'm looking at the
demangler situation. It looks like libcxxabi has an arena-based demangler,
and that the one in llvm is different.

I'm confused by this because the comment in llvm says that libcxxabi is
supposed to reuse the llvm demangler. This doesn't seem to be happening,
right?

This seems correct. libcxxabi demangler [1] is different from the one used
by llvm [2]. I'm hoping Saleem, Eric or Jon (copied) knows a bit of history
as to why this is so (perhaps because the two projects evolved
independently ?).

They didnt really evolve independently, the version in LLVM was imported
from libc++. However, we simplified it to make it more portable. The
simpifications naturally led to the ability to remove the arena allocation
routines. The copy in libc++ needs to retain a certain amount of
flexibility due to the exporting of the interface into the user's address
space (via the __cxa_demangle interface). However, making adjustments that
improve performance in the LLVM version should be acceptable.

One of the things we need in lldb is to find a "basename" of a
function. This is so that you can set a breakpoint on a function named
"foo", even if "foo" is a method of a template class declared inside
another function, which is inside 5 namespaces, a couple of them being
anonymous... :slight_smile:

Currently we do that by parsing the demangled name, which is neither
fast nor easy, especially when function pointers come into the game
(D31451). We could probably save some runtime if the demangler could
provide us with a bit more structured information instead of just the
final demangled string. I expect the demangler to be in a much better
position to do that then us trying to reverse engineer the string.

I have no idea how this fits in with the rest of the goals of llvm's
demangler, but I guess it doesn't hurt throwing the idea out there.

cheers,
pl

[+lldb-dev]