Inquiry for performance monitors

Hello,
I want to implement support for reading Performance measurement information using the perf_event_open system calls. The motive is to add support for Intel PT hardware feature, which is available through the perf_event interface. I was thinking of implementing a new Wrapper like PtraceWrapper in NativeProcessLinux files. My query is that, is this a correct place to start or not ? in case not, could someone suggest me another place to begin with ?

BR,
A Ravi Theja

[ Moving this discussion back to the list. I pressed the wrong button
when replying.]

Thanks for the explanation Ravi. It sounds like a very useful feature
indeed. I've found a reference to the debugserver profile data in
GDBRemoteCommunicationClient.cpp:1276, so maybe that will help with
your investigation. Maybe also someone more knowledgeable can explain
what those A packets are used for (?).

There are two different kinds of performance counters: OS performance counters and CPU performance counters. It sounds like you’re talking about the latter, but it’s worth considering whether this could be designed in a way to support both (i.e. even if you don’t do both yourself, at least make the machinery reusable and apply to both for when someone else wanted to come through and add OS perf counters).

There is also the question of this third party library. Do we take a hard dependency on libipt (probably a non-starter), or only use it if it’s available (much better)?

As Pavel said, how are you planning to present the information to the user? Through some sort of top level command like “perfcount instructions_retired”?

IMHO the best way to provide this information is to implement reverse debugging packets in a GDB server (lldb-server). If you enable this feature via some packet to lldb-server, and that enables the gathering of data that keeps the last N instructions run by all threads in some buffer that gets overwritten. The lldb-server enables it and gives a buffer to the perf_event_interface(). Then clients can ask the lldb-server to step back in any thread. Only when the data is requested do we actually use the data to implement the reverse stepping.

Another way to do this would be to use a python based command that can be added to any target that supports this. The plug-in could install a set of LLDB commands. To see how to create new lldb command line commands in python, see the section named "CREATE A NEW LLDB COMMAND USING A PYTHON FUNCTION" on the http://lldb.llvm.org/python-reference.html web page.

Then you can have some commands like:

intel-pt-start
intel-pt-dump
intel-pt-stop

Each command could have options and arguments as desired. The "intel-pt-start" command could make an expression call to enable the feature in the target by running and expression that runs the some perf_event_interface calls that would allocate some memory and hand it to the Intel PT stuff. The "intel-pt-dump" could just give a raw dump all of history for one or more threads (again, add options and arguments as needed to this command). The python code could bridge to C and use the intel libraries that know how to process the data.

If this all goes well we can think about building it into LLDB as a built in command.

one main benefit to doing this externally is allow this to be done remotely over any debugger connection. If you can run expressions to enable/disable/setup the memory buffer/access the buffer contents, then you don't need to add code into the debugger to actually do this.

Greg

Hello,
Regarding the questions in this thread please find the answers →

How are you going to present this information to the user? (I know
debugserver can report some performance data… Have you looked into
how that works? Do you plan to reuse some parts of that
infrastructure?) and How will you get the information from the server to the client?

Currently I plan to show a list of instructions that have been executed so far, I saw the

implementation suggested by pavel, the already present infrastructure is a little bit lacking in terms of the needs of the
project, but I plan to follow a similar approach, i.e to extract the raw trace data by querying the server (which can use the
perf_event_open to get the raw trace data from the kernel) and transport it through gdb packets ( qXfer packets
https://sourceware.org/gdb/onlinedocs/gdb/Branch-Trace-Format.html#Branch-Trace-Format). At the client side the raw trace data
could be passed on to python based command that could decode the data. This also eliminates the dependency of libipt since LLDB
would not decode the data itself.

There is also the question of this third party library. Do we take a hard dependency on libipt (probably a non-starter), or only use it if it’s available (much better)?

With the above mentioned way LLDB would not need the library, who ever wants to use the python command would have to install it separately but LLDB wont need it

With the performance counters, the interface would still be perf_event_open, so if there was a perf_wrapper in LLDB server then it could be reused to configure and use the
software performance counters as well, you would just need to pass different attributes in the perf_event_open system call, plus I think the perf_wrapper could be reused to
get CoreSight information as well (see https://lwn.net/Articles/664236/ )

One thing to think about is you can actually just run an expression in the program that is being debugged without needing to change anything in the GDB remote server. So this can all be done via python commands and would require no changes to anything. So you can run an expression to enable the buffer. Since LLDB supports multiple line expression that can define their own local variables and local types. So the expression could be something like:

int perf_fd = (int)perf_event_open(...);
struct PerfData
{
    void *data;
    size_t size;
};
PerfData result = read_perf_data(perf_fd);
result

The result is then a structure that you can access from your python command (it will be a SBValue) and then you can read memory in order to get the perf data.

You can also split things up into multiple calls where you can run perf_event_open() on its own and return the file descriptor:

(int)perf_event_open(...)

This expression will return the file descriptor

Then you could allocate memory via the SBProcess:

(void *)malloc(1024);

The result of this expression will be the buffer that you use...

Then you can read 1024 bytes at a time into this newly created buffer.

So a solution that is completely done in python would be very attractive.

Greg

Ok, that is one option, but one of the aim for this activity is to make the data available for use by the IDE’s like Android Studio or XCode or any other that may want to display this information in its environment so keeping that in consideration would the complete python based approach be useful ? or would providing LLDB api’s to extract raw perf data from the target be useful ?

It feels to me that the python based approach could run into a dead
end fairly quickly: a) you can only access the data when the target is
stopped; b) the self-tracing means that the evaluation of these
expressions would introduce noise in the data; c) overhead of all the
extra packets(?).

So, I would be in favor of a lldb-server based approach. I'm not
telling you that you shouldn't do that, but I don't think that's an
approach I would take...

pl

And what about the ease of integration into a an IDE, I don’t really know if the python based approach would be usable or not in this context ?

Speaking for Android Studio, I think that we *could* use a
python-based implementation (hard to say exactly without knowing the
details of the implementation), but I believe a different
implementation could be *easier* to integrate. Plus, if the solution
integrates more closely with lldb, we could surface some of the data
in the command-line client as well.

pl

If you want to go with the path to implement it outside LLDB then I would suggest to implement it with an out of tree plugin written in C++. You can use the SB API the same way as you can from python and additionally it have a few advantages:

  • You have a C/C++ API what makes it easy to integrate the functionality into an IDE (they just have to link to your shared library)
  • You can generate a Python API if you need one with SWIG the same way we do it for the SB API
  • You don’t have to worry about making the code both Python 2.7 and Python 3.5 compatible

You can see a very simple example for implementing an out of tree C++ plugin in /examples/plugins/commands

Hello Pavel,
In the case of expression evaluation approach you mentioned that:

  1. The data could be accessible only when the target is stopped. why is that ?
  2. What sort of noise were you referring to ?

BR,

A Ravi Theja

Hello Pavel,
                In the case of expression evaluation approach you mentioned
that:
1. The data could be accessible only when the target is stopped. why is that
?

If I understand the approach correctly, the idea is the run all perf
calls as expressions in the debugger. Something like

expr perf_event_open(...)

We need to stop the target to be able to do something like that, as we
need to fiddle with its registers. I don't see any way around that...

2. What sort of noise were you referring to ?

Since now all the perf calls will be expressions executed within the
context of the process being traced, they themselves will show up in
the trace. I am sure we could filter that out somehow, but it feels
like an added complication..

Does that make it any clearer?

pl

Yes, thanx for the clarification.

Hello Pavel

As per my understanding, instead of doing it by expression evaluation
if the code (to enable pt and gathering the raw traces) is written on
lldb-server side, then also lldb-server will have to wait for the
inferior to stop in order to encapsulate all the traces in packets and
send them to client for analysis.

Is it possible that client can request the lldb-server to send it a
part of the raw traces while the inferior is still running?

- Abhishek

Hi,

This is certainly possible. The server already sends us the stdout
from the inferior this way. There is even some support for gathering
"profile data" in the client (see
GDBRemoteCommunicationClient.cpp:1286), presumably gathering data from
debugserver, as lldb-server does not send such packets. If needed, we
can send the same packets from lldb-server. Or, if these are not
suitable, we can add another kind of packets -- the protocol through
which they communicate is fully under our control.

cheers,
pl

So a few questions: people seem worried about running something in the process if expression are being used. Are you saying that if the process is on the local machine, process 1 can just open up a file descriptor to the trace data for process 2? If so, why pass this through lldb-server? I am not a big fan making the lldb-server become the conduits for a ton of information. It just isn't built for that high volumes of data coming in. I can be done, but that doesn't mean it should. If everyone starts passing data like memory usage, CPU time, trace info, backtraces and more through asynchronously through lldb-server, it will become a very crowded communication channel.

You don't need python if you want to do this using the lldb API. If your IDE is already linking against the LLDB shared library, it can just run the expressions using the public LLDB API. This is how view debugging is implemented in Xcode. It runs complex expressions that gather all data about a view and its subviews and returns all the layers in a blob of data that can be serialized by the expression, retrieved by Xcode (memory read from the process), and then de-serialized by the IDE into a format that can be used. If your IDE can access the trace data for another process, why not just read it from the IDE itself? Why get the lldb-server involved? Granted the remote debugging parts of this make an argument for including it in the lldb-server. But if you go this route you need to make a base implementation for trace data that will work for any trace data, have trace data plug-ins that somehow know how to interpret the data and provide.

How do you say "here is a blob of trace data" I just got from some process, go find me a plug-in that can parse it. You might have to say "here is a blob of data" and it is for the "intel" trace data plug-in. How are we going to know which trace data to ask for? Is the packet we send to lldb-server going to reply to "qGetTraceData" with something that says the type of data is "intel-IEEE-version-123.3.1" and the data is "xxxxxxx"? Then we would find a plug-in in LLDB for that trace data that can parse it? So you will need to think about completely abstracting the whole notion of trace data into some sensible API that gets exposed via SBProcess.

So yes, there are two approaches to take. Let me know which one is the way you want to go. But I really want to avoid the GDB remote protocol's async packets becoming the conduit for a boat load of information.

Greg Clayton

Hi Greg

Please find any answers/queries inlined:

Hello Pavel,
In the case of expression evaluation approach you mentioned
that:

  1. The data could be accessible only when the target is stopped. why is that
    ?
    If I understand the approach correctly, the idea is the run all perf
    calls as expressions in the debugger. Something like
    expr perf_event_open(…)
    We need to stop the target to be able to do something like that, as we
    need to fiddle with its registers. I don’t see any way around that…
  1. What sort of noise were you referring to ?
    Since now all the perf calls will be expressions executed within the
    context of the process being traced, they themselves will show up in
    the trace. I am sure we could filter that out somehow, but it feels
    like an added complication…

Does that make it any clearer?

So a few questions: people seem worried about running something in the process if expression are being used. Are you saying that if the process is on the local machine, process 1 can just open up a file descriptor to the trace data for process 2? If so, why pass this through lldb-server?

As you have also mentioned later in your email, irrespective of what approach we use to implement this feature, we will have to send the trace data from lldb-server to client in case of remote debugging. Moreover even for local debugging, the current architecture of lldb is a client-server architecture (atleast for macosx, linux and freebsd) as per my knowledge. Hence, traces will have to be sent in form of packets from server to client even for the expression evaluation approach.

I am not a big fan making the lldb-server become the conduits for a ton of information. It just isn’t built for that high volumes of data coming in. I can be done, but that doesn’t mean it should. If everyone starts passing data like memory usage, CPU time, trace info, backtraces and more through asynchronously through lldb-server, it will become a very crowded communication channel.

As per my understanding, one of the difference the expression evaluation approach provides is to disallow sending traces from server to client asynchronously (as traces can’t be sent until inferior stops). If increased number of asynchronous packets are the concern here then we can choose to send the trace data only synchronously (i.e. only after the inferior stops). Or can’t we ?

You don’t need python if you want to do this using the lldb API. If your IDE is already linking against the LLDB shared library, it can just run the expressions using the public LLDB API. This is how view debugging is implemented in Xcode. It runs complex expressions that gather all data about a view and its subviews and returns all the layers in a blob of data that can be serialized by the expression, retrieved by Xcode (memory read from the process), and then de-serialized by the IDE into a format that can be used. If your IDE can access the trace data for another process, why not just read it from the IDE itself? Why get the lldb-server involved? Granted the remote debugging parts of this make an argument for including it in the lldb-server. But if you go this route you need to make a base implementation for trace data that will work for any trace data, have trace data plug-ins that somehow know how to interpret the data and provide.

Thanks for suggesting this.

How do you say “here is a blob of trace data” I just got from some process, go find me a plug-in that can parse it. You might have to say “here is a blob of data” and it is for the “intel” trace data plug-in. How are we going to know which trace data to ask for? Is the packet we send to lldb-server going to reply to “qGetTraceData” with something that says the type of data is “intel-IEEE-version-123.3.1” and the data is “xxxxxxx”? Then we would find a plug-in in LLDB for that trace data that can parse it? So you will need to think about completely abstracting the whole notion of trace data into some sensible API that gets exposed via SBProcess.

We need to think a bit more on this.

So yes, there are two approaches to take. Let me know which one is the way you want to go. But I really want to avoid the GDB remote protocol’s async packets becoming the conduit for a boat load of information.

In order to configure/start/finish the tracing feature, a lot of expression evaluations will have to be done (atleast perf_event_open(), mmap(), perf_event_close() are the ones I know of). The main reason I am skeptical of expression evaluation approach is the amount of extra packets to be sent to the lldb-server to configure/start/finish tracing. Hence, I am more in favor of writing the code to configure/start/finish the tracing on lldb-server side. Once we have all the traces with us and inferior stops, we can send them only synchronously over the communication channel.

Please correct me if I there is something wrong with my understanding. Thanks a lot for your detailed email.

Greg Clayton


lldb-dev mailing list
lldb-dev@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev

  • Abhishek

Hi Everyone

I have developed a tool that facilitates lldb users using Intel(R) Processor Trace technology for debugging applications (as per discussions in this thread). The patch is https://reviews.llvm.org/D33035.

Some highlights of this tool are:

  1. The tool is built on top of lldb. It is not a part of liblldb shared library. It resides in tool/intel-features folder. Anyone willing to use this feature can compile this tool (by enabling some extra flags) using cmake while building lldb.
  2. As it was suggested, the trace decoding library hasn’t been made a part of lldb repository. It can be downloaded from the corresponding github repo.
  3. All intel specific features are combined to form single shared library thereby not cluttering lldb repository with each intel specific feature (proposed by Pavel).

If something has changed or you have new concerns regarding this tool since the last discussion in this thread, please let me know.

  • Abhishek