Breakpoint + callback performance ... Can it be faster?

I recently started using lldb to write a basic instrumentation tool for tracking the values of variables at various code-points in a program. I’ve been working with lldb for less than two weeks, so I am pretty new. Though, I have used and written llvm passes in the past, so I’m familiar with the clang/llvm/lldb ecosystem.

I have a very early prototype of the tool up and running, using the C++ API. The user can specify either an executable to run or an already-running PID to attach to. The user also supplies a file+line_number at which a breakpoint (with a callback) is placed. For testing/prototyping purposes, the breakpoint callback just increments a counter and then immediately returns false. Eventually, more interesting things will happen in this callback.

I’ve noticed that just the action of hitting a breakpoint and invoking the callback is very expensive. I did some instruction-count collection by running this lldb tool on a simple test program, and placing the breakpoint+callback at different points in the program, causing it to get triggered different amounts of times. I used perf stat -e instructions ... to gather instruction exec counts for each run. After doing a little math, it appears that I’m incurring 1.0 - 1.1 million instruction execs per breakpoint.

This amount of slowdown is prohibitively expensive for my needs, because I want to place callbacks in hot portions of the “inferior” program.

Is there a way to make this faster? Is it possible to create “lighter-weight” breakpoints? I really like the lldb API (though the documentation is lacking in some places), but if this performance hit can’t be mitigated, it may be unusable for me.

For reference, this is the callback function:

static int cb_count = 0;
bool SimpleCallback (
void *baton,
lldb::SBProcess &process,
lldb::SBThread &thread,
lldb::SBBreakpointLocation &location) {
//TODO: Eventually do more interesting things...
cb_count++;
return false;
}

And here is how I set it up to be called back:

lldb::SBBreakpoint bp1 = debugger_data->target.BreakpointCreateByLocation(file_name, line_no);
if (!bp1.IsValid()) std::cerr << "invalid breakpoint";
bp1.SetCallback(SimpleCallback, 0);

-Benjamin

Are you sure the actual handling of the breakpoint & callback in lldb is what is taking most of the time? The last time we looked at this, the majority of the work was in communicating with debugserver to get the stop notification and restart. Note, besides all the packet code, this involves context switches from process->lldbserver->lldb and back, which is also pretty expensive.

Greg just switched to using a unix-domain socket for this communication for platforms that support it. This speeds up the packet traffic side of things.

One of the original motivations of having lldb-server be based on lldb classes - as opposed to the MacOS X version of debugserver which is an independent construct - was that you could re-use the server code to create an in-process Process plugin, eliminating a lot of this traffic & context switching when you needed maximum speed. The original Mac OS X lldb port actually had a process plugin wholly in-process with lldb as well as the debugserver based one, but there wasn't enough motivation to justify maintaining the two different implementations of the same code. I don't know whether the Linux port takes advantage of this possibility, however. That would be something to look into, however.

Once we actually figure out about the stop, figuring out the breakpoint and getting to its callback is pretty simple... I doubt making "lighter weight breakpoints" in particular will recover the performance you need, though if your sampling turns up some inefficient algorithms have crept in, it would be great to fix that.

Another option we've toyed with on and off is something like the gdb "tracepoints" were you can upload instructions to perform "experiments" when a breakpoint is hit to the lldb-server instance. The work to perform the experiment and the results would all be kept in the lldb-server instance till a real breakpoint is hit, at which point lldb can download all the results and present them to the user. This would eliminate some of the context-switches and packet traffic while you were running in the hot parts of your code. This is a decent chunk of work, however.

Jim

Thanks for the quick reply.

Are you sure the actual handling of the breakpoint & callback in lldb is what is taking most of the time?

I’m not positive. I did collect some callgrind profiles to take a look at where most of the time is being spent, but i’m not very familiar with lldb internals so the results were hard to interpret. I did notice that there was a lot of packet/network business when using lldb to profile a program (which I assumed was communication between my program and lldb-server). I was not sure how this effected the performance, so perhaps this is the real bottleneck.

Greg just switched to using a unix-domain socket for this communication for platforms that support it. This speeds up the packet traffic side of things.

In what version of lldb was this introduced? I’m running 3.7.1. I’m also on ubuntu 14.04, is that a supported platform?

One of the original motivations of having lldb-server be based on lldb classes - as opposed to the MacOS X version of debugserver which is an independent construct - was that you could re-use the server code to create an in-process Process plugin, eliminating a lot of this traffic & context switching when you needed maximum speed.

That sounds very interesting. Is there an example of this implementation you could point me to?

Thanks for the quick reply.

> Are you sure the actual handling of the breakpoint & callback in lldb is what is taking most of the time?

I'm not positive. I did collect some callgrind profiles to take a look at where most of the time is being spent, but i'm not very familiar with lldb internals so the results were hard to interpret. I did notice that there was a lot of packet/network business when using lldb to profile a program (which I assumed was communication between my program and lldb-server). I was not sure how this effected the performance, so perhaps this is the real bottleneck.

I would be pretty surprised if it was not. We had some bugs in breakpoint handling - mostly related to having very very many breakpoints. But other than that the dispatching of the breakpoint StopInfo is a pretty simple, straight forward bit of work.

> Greg just switched to using a unix-domain socket for this communication for platforms that support it. This speeds up the packet traffic side of things.

In what version of lldb was this introduced? I'm running 3.7.1. I'm also on ubuntu 14.04, is that a supported platform?

It is just in TOT lldb, he just added it last week. It is currently only turned on for OS X.

> One of the original motivations of having lldb-server be based on lldb classes - as opposed to the MacOS X version of debugserver which is an independent construct - was that you could re-use the server code to create an in-process Process plugin, eliminating a lot of this traffic & context switching when you needed maximum speed.

That sounds very interesting. Is there an example of this implementation you could point me to?

FreeBSB & Windows still have native Process plugins. But they aren't used for the lldb-server implementation so far as I can tell (I've mostly worked on the OS X side.) I think this was more of a design intent that hasn't actually been used anywhere yet. But the Linux/Android folks will know better.

Jim

>
> Thanks for the quick reply.
>
> > Are you sure the actual handling of the breakpoint & callback in lldb
is what is taking most of the time?
>
> I'm not positive. I did collect some callgrind profiles to take a look
at where most of the time is being spent, but i'm not very familiar with
lldb internals so the results were hard to interpret. I did notice that
there was a lot of packet/network business when using lldb to profile a
program (which I assumed was communication between my program and
lldb-server). I was not sure how this effected the performance, so perhaps
this is the real bottleneck.

I would be pretty surprised if it was not. We had some bugs in breakpoint
handling - mostly related to having very very many breakpoints. But other
than that the dispatching of the breakpoint StopInfo is a pretty simple,
straight forward bit of work.

>
> > Greg just switched to using a unix-domain socket for this
communication for platforms that support it. This speeds up the packet
traffic side of things.
>
> In what version of lldb was this introduced? I'm running 3.7.1. I'm also
on ubuntu 14.04, is that a supported platform?

It is just in TOT lldb, he just added it last week. It is currently only
turned on for OS X.

Good to know, thanks.

>
> > One of the original motivations of having lldb-server be based on lldb
classes - as opposed to the MacOS X version of debugserver which is an
independent construct - was that you could re-use the server code to create
an in-process Process plugin, eliminating a lot of this traffic & context
switching when you needed maximum speed.
>
> That sounds very interesting. Is there an example of this implementation
you could point me to?
>

FreeBSB & Windows still have native Process plugins. But they aren't used
for the lldb-server implementation so far as I can tell (I've mostly worked
on the OS X side.) I think this was more of a design intent that hasn't
actually been used anywhere yet. But the Linux/Android folks will know
better.

If any of the Linux/Andriod folks do know, please get in touch with me.
Thanks,

Hello Benjamin, all,

the lldb-server implementation in linux works exactly the same way as
debugserver does on osx -- it runs out of process and uses sockets to
communicate with the client. The socketpair() optimization that Jim is
talking about is not enabled there yet - I want to do some benchmarks
first to make sure it really helps. Feel free to try it out if you
want, I'd be very interested in hearing the results (the relevant
commit is r278524). However, I doubt it will make enough difference to
make your use case work.

Moving the debugging code into the same process could easily make
things an order of magnitude faster and it _might_ be enough for your
purposes, but it's an extremely non-trivial task. I think doing that
would be a great idea, but I would expect a serious commitment to
maintaining and testing that path from whoever wanted to do that.

cheers,
pl