Symbolic Queries + LLDB

Hi all,

I'm a CS student on a Masters program at Imperial (UK) and Cristian Cadar (one of the authors of the original KLEE paper, http://www.doc.ic.ac.uk/~cristic/papers/klee-osdi-08.pdf) and I will be undertaking a project related to KLEE. One of our ideas is adding symbolic queries to LLDB, so users can not only get the value of a particular variable but also the range of values it can take.

Before choosing any particular project, I wanted to ask whether there's any external interest in adding symbolic queries to LLDB.

Thanks,
Milen

I would definitely love to see this come to LLDB.

Are you just asking if we'd like to contribute, or if we're interested
in it happening in general?

-y

Just asking whether anyone is interested - I'll be implementing all of it. The only kind of potential support needed would be some general guidance on how to implement it so that it fits nicely with the internal structure.

The deadline for taking the decision to undertake the project is on Wednesday, so I'll send another email if I do decide to go ahead with it.

M

Milen,

when you say "the range of values it can take," are you referring to adding a kind of watchpoint that records range information, so that you can perform integer range analysis etc? Or are you referring to a query that operates on all instances of a variable (say, a class member) existent at the current time? Given what KLEE does, I assume the former.

Right now, even before you start looking at the query infrastructure – or even the watchpoints, really – LLDB really needs support for keeping time-stamped metadata about variables and user interactions. Because LLDB uses editline, it gets some level of command-line history, but that's pretty much it right now. A proper metadata infrastructure could provide full history for variable values and function executions, providing a foundation for a variety of LLDB-based program analysis tools.

Adding this kind of metadata support to LLDB would be a sizable piece of work, but it could allow you to bring over versions of some KLEE-based tests. What do you think?

Sean

It sounds very interesting indeed. I would be more than willing to help give you general guidance to help the code fit into the LLDB codebase, so don't worry about not being able to get answers. We would definitely need to have some architectural discussions on what is currently in LLDB and how it is structured, and we can then figure out how and where this code would go.

Greg Clayton

Sean,

Milen,

when you say "the range of values it can take," are you referring to adding a kind of watchpoint that records range information, so that you can perform integer range analysis etc? Or are you referring to a query that operates on all instances of a variable (say, a class member) existent at the current time? Given what KLEE does, I assume the former.

Yes, I was referring to the former. As an example, the programmer would turn on range tracking for a particular variable. Let's assume execution has gone down the path if(x < 5) and the debugger stopped at a breakpoint along that execution. The programmer can then ask what the possible values of x are and it will say [int_min, 5).

I would assume that's relatively easy to do - is there anything in particular I should be aware of?

Right now, even before you start looking at the query infrastructure – or even the watchpoints, really – LLDB really needs support for keeping time-stamped metadata about variables and user interactions. Because LLDB uses editline, it gets some level of command-line history, but that's pretty much it right now. A proper metadata infrastructure could provide full history for variable values and function executions, providing a foundation for a variety of LLDB-based program analysis tools.

That sounds quite interesting. I'll be happy to implement the metadata infrastructure, time permitting. Another question - doing the range analysis does not depend on the aforementioned metadata subsystem, as far as I understand?

Adding this kind of metadata support to LLDB would be a sizable piece of work, but it could allow you to bring over versions of some KLEE-based tests. What do you think?

Sounds appealing to me. As the project would need to make measurable progress, ideally it should be possible to do the range analysis / symbolic queries as a starting point and optionally go a lot further.

Milen

Milen,

it sounds like what you're proposing could be done with Python code. One thing you'd have to be careful about is figuring out where the conditionals are and what they are testing. Are you going to pre-parse the code before it comes into LLDB? We don't have an infrastructure for getting ASTs at the moment.

If you implement this on top of LLDB using the external API, I think that will allow you to get as many results as possible without needing to go in and implement internal features. Then, later, once you're confident about what information needs to be associated with what data structures, we can discuss adding that as a general infrastructure that anyone can use.

Sean

We currently do link against much of clang and llvm, so it wouldn't be out of the question to be able to compile an AST inside the LLDB walls and do the analysis in house. The main issue we have with the clang and llvm stuff is that there is no exportable API. LLDB on the other hand does try and provide a public API, so it might be best to do some of this inside LLDB and only expose what clients would need.

Milen,

Have you thought about what API you would want for this?

Sean,

Milen,

it sounds like what you're proposing could be done with Python code.

I'd personally prefer doing it in C++ (lldb's native plugin language as far as I know). I assume that shouldn't be a problem.

One thing you'd have to be careful about is figuring out where the conditionals are and what they are testing. Are you going to pre-parse the code before it comes into LLDB? We don't have an infrastructure for getting ASTs at the moment.

What kind of representation / information does lldb currently have? Ideally, the AST would be the most useful. One unknown (to me) would be how the mapping between variable names and compiled code would be accomplished (I assume that lldb doesn't have access to llvm IR of the running program) - in any case, I assume the mapping should be a relatively trivial engineering problem. Any thoughts on that?

If you implement this on top of LLDB using the external API, I think that will allow you to get as many results as possible without needing to go in and implement internal features. Then, later, once you're confident about what information needs to be associated with what data structures, we can discuss adding that as a general infrastructure that anyone can use.

I think this sounds as the approach to take. I'm very interested into adding range information histories so that analysis tools can be built on top.

Milen

Greg,

Ideally I would need to know about the result of any comparison expressions involving the variables that are being tracked in order to compute range information.

Milen

Milen,

As far as "native plugin languages" go, Python is the language we use to implement higher-level features (like test-suites) on top of LLDB. Our plug-ins, which are (as you say) written in C++, typically provide LLDB with specific implementations of functionality, like object-file reading, or process control, that varies from platform to platform. At first glance, your feature would not appear to be a "plug-in" in that sense.

LLDB has full type information, and LLDB's expression parser can get the values of variables that are available in the current scope. LLDB (like GDB) gets the information about what variables are in scope at a particular point from the debug information provided with the program (on OS X, the DWARF debugging information). This information does not come with full ASTs, only with a mapping from assembly locations to source lines.

What you will need to implement, if you want to find conditionals, is a mechanism that discovers conditionals in the source code of a function, and then maps them back using the debugging information to assembly locations. Then, wherever you're stopped, you can see what conditions are true at the current point, and make your symbolic deductions. Discovering conditionals involves parsing, which LLVM can do for you. I would be wary of implementing source-code parsing as a general feature of LLDB, because it's difficult to determine the exact context in which a source file was compiled – i.e., the #defines that were set, etc.

It sounds like a good first step might be to sketch out the algorithm you intend to use for finding what conditions hold at the current execution point, and then seeing what parts require new API functionality.

Sean

Sean,

As far as "native plugin languages" go, Python is the language we use to implement higher-level features (like test-suites) on top of LLDB. Our plug-ins, which are (as you say) written in C++, typically provide LLDB with specific implementations of functionality, like object-file reading, or process control, that varies from platform to platform. At first glance, your feature would not appear to be a "plug-in" in that sense.

I see. In this case, the only "plug-in" parts would be any bits that need to expose additional information in order to implement the symbolic queries.

What you will need to implement, if you want to find conditionals, is a mechanism that discovers conditionals in the source code of a function, and then maps them back using the debugging information to assembly locations. Then, wherever you're stopped, you can see what conditions are true at the current point, and make your symbolic deductions. Discovering conditionals involves parsing, which LLVM can do for you. I would be wary of implementing source-code parsing as a general feature of LLDB, because it's difficult to determine the exact context in which a source file was compiled – i.e., the #defines that were set, etc.

So, from what I can tell, the general algorithm would need to:
1) find conditionals using the LLVM parser (I don't know if DWARF contains references to source code)
2) The debugging information can now map variables to registers etc

Regarding determining which conditional paths were taken whenever we are stopped - would injecting code after the branch points which will add outcome data to a data structure for the current stack frame (I'm making a lot of assumptions here) work? I know that gdb implements breakpoints this way, I think (code injection + trapping). As an example, after if(a < b) and the programming wanting to track a, we would pull the value of b from the appropriate register, record that information and continue execution.

It sounds like a good first step might be to sketch out the algorithm you intend to use for finding what conditions hold at the current execution point, and then seeing what parts require new API functionality.

Would the algorithm above work or does this not fall within the LLVM architecture?

I'm very interested in the project, so I will have a final discussion with my mentor tomorrow. My proposal would most likely include symbolic range queries and optionally a metadata framework so that we can accumulate data during runtime. In the future, range information + history of the ranges for variables could be used to develop some more advanced functionalities within lldb. Any thoughts on that?

Milen

After speaking with Daniel Dunbar and Ted Kremenek a bit, they didn't really see that value add of doing this in LLDB.

What is the information would you like to get out of the debugger that isn't available just from the AST's and the symbolic query stuff that already works with llvm/clang?

Greg

Greg,