[GSoC 2014] Using LLVM as a code-generation backend for Valgrind

Hi,

I've seen on the LLVM's Open Projet Page [1] an idea about using LLVM to generate native code in Valgrind. For what I know, Valgrind uses libVEX to translate native instructions into a bitcode, used to add the instrumentation and then translated back to native code for execution.

Valgrind and LLVM are two tools that I use nearly every day. I'm also very interested in code generation and optimization, so adding the possibility to use LLVM to generate native code in libVEX interests me very much. Is it a good idea? Could a LLVM backend bring something useful to Valgrind (for instance, faster execution or more targets supported) ?

I've sent this message to the LLVM and Valgrind mailing lists because I've originally found the idea on the LLVM's website, but Valgrind is the object of the idea. By the way, does anyone already know if LLVM or Valgrind will be a mentoring organization for this year's GSoC?

You can find in [2] the list of my past projects. During the GSoC 2011, I had the chance to use the Clang libraries to compile C code, and the LLVM JIT to execute it (with instrumented stdlib functions). I have also played with the LLVM C bindings to generate code when I explored some parts of Mesa.

Denis Steckelmacher

[1] : http://llvm.org/OpenProjects.html#misc_new
[2] : http://steckdenis.be/page-projects.html

I work on LLVM and used to work on Dr. Memory, a memory debugger similar to Valgrind’s Memcheck. I think using LLVM as a Valgrind backend is probably feasible, but I highly doubt that LLVM will be able to generate code quickly enough to make it usable. It might be worthwhile as a second level JIT, but usually the major downside to DBI tools is the startup overhead, and LLVM code generation can only increase that.

That doesn’t mean it wouldn’t be a fun project, though. :slight_smile:

It would also be interesting to cache the LLVM-generated code between runs, or even do it as an AOT pass. A lot of the times I have problems where Valgrind would help, I end up needing multiple runs and each one takes several minutes of execution time (sometimes 5-10, because of the slowdown). I've also found that for some applications, Valgrind's overhead makes it impossible to reproduce some timing issues and so the segfault just goes away if you run the code slower.

David

>> I work on LLVM and used to work on Dr. Memory, a memory debugger similar to Valgrind's Memcheck. I think using LLVM as a Valgrind backend is probably feasible, but I highly doubt that LLVM will be able to generate code quickly enough to make it usable. It might be worthwhile as a second level JIT, but usually the major downside to DBI tools is the startup overhead, and LLVM code generation can only increase that.
>
> It would also be interesting to cache the LLVM-generated code between runs, or even do it as an AOT pass. A lot of the times I have problems where Valgrind would help, I end up needing multiple runs and each one takes several minutes of execution time (sometimes 5-10, because of the slowdown). I've also found that for some applications, Valgrind's overhead makes it impossible to reproduce some timing issues and so the segfault just goes away if you run the code slower.
>

It is a very interesting idea! LLVM bitcode is easily serialized and deserialized to bitcode files, and these files can even be linked together. It can be quite useful if the serialization is also done in a JIT fashion, in order not to slow down too much the startup of the program.

I think a more interesting idea would be to use LLVM to perform instrumentation and then to use Valgrind to instrument third-party libraries linked into the program.

What I'm imagining is this: Let's say you instrument a program with SAFECode or Asan to find memory safety errors. When you run the program under Valgrind, the portion of the code instrumented by SAFECode or Asan runs natively without dynamic binary instrumentation because it's already been instrumented. When the program calls uninstrumented code (e.g., code in a dynamic library), Valgrind starts dynamic binary instrumentation to do instrumentation.

A really neat thing you could do with this is to share run-time data structures between the LLVM and Valgrind instrumentation. For example, Valgrind could use SAFECode's meta-data on object allocations and vice-versa.

If you're really clever, the LLVM instrumentation could be added in a way where it's off by default by enabled when the program is run under Valgrind.

The net effect is that most of the instrumentation works faster because it was added at compile-time, but code compiled with another compiler can still be instrumented by Valgrind with a performance penalty.

-- John T.

-valgrind-dev, it bounced for me
+timurrrr
+zhaoqin
+eugenis

Hi,

I've seen on the LLVM's Open Projet Page [1] an idea about using LLVM to
generate native code in Valgrind. For what I know, Valgrind uses libVEX to
translate native instructions into a bitcode, used to add the
instrumentation and then translated back to native code for execution.

I think a more interesting idea would be to use LLVM to perform
instrumentation and then to use Valgrind to instrument third-party
libraries linked into the program.

We did this with DynamoRIO, ASan, and MSan, and published the results:
http://research.google.com/pubs/pub41440.html

It's a cool idea, but we haven't been able to productionize it enough to
test Chromium yet. The code for the msan side is actually in compiler-rt:
http://llvm.org/viewvc/llvm-project/compiler-rt/trunk/lib/msandr/msandr.cc?view=markup

Ultimately it may be easier (on Linux) to build new instrumented packages
for every library that you care about testing with.

Someone proposed to cache the results of a JIT compilation. Caching LLVM bitcode is easy (and the LLVM optimizations operate on bitcode, so they don't need to be re-run on bitcode reload), and may be a good way to fasten Valgrind. Caching native binary code is more difficult and would only be useful if LLVM's codegen is slow (I think that the codegen can be configured to be fast, for instance by using a simpler register allocator).

If every .so is cached in a separate bitcode file, loading an application would only require the generation of bitcode for the application itself, not the libraries it uses, provided that they didn't change since another application using them was analyzed. That may speed up the start-up of Valgrind.

Valgrind is still going to be single threaded, right?

Hi,

I did some work in libVEX a few years ago.
AFAIR, the gains from improved code generation in libVEX are very hard to be noticeable, and they may even increase the overall run-time since the performance improvement in the code quality may not payoff the extra time you spend in analysis (well, that's probably true for every JIT compiler).
Therefore, and since an LLVM JIT would significantly increase the run-time of libVEX's code gen, I think this probably only makes sense for hot super basic blocks. Most BBs are irrelevant. libVEX doesn't have a profiling infrastructure to pin-point hot BBs, AFAIR, though.
One of valgrind's weaknesses is SIMD code. For example, openssl code under valgrind takes forever to execute. LLVM could potentially help there since it now has (two) BB vectorizers.

I'm sure the (very friendly) Valgrind developers can give you a few more hints (and correct me if I'm wrong).
I'm clueless about this year's GSoC.

Nuno

Hi,

only one letter got to valgrind-developers mailing list. I'll quote
the first message of the thread so that those who do not read llvmdev
knew what's this discusssion about.

=== Begin of the first message ===

Hi,

I've seen on the LLVM's Open Projet Page [1] an idea about using LLVM to
generate native code in Valgrind. For what I know, Valgrind uses libVEX
to translate native instructions into a bitcode, used to add the
instrumentation and then translated back to native code for execution.

Valgrind and LLVM are two tools that I use nearly every day. I'm also
very interested in code generation and optimization, so adding the
possibility to use LLVM to generate native code in libVEX interests me
very much. Is it a good idea? Could a LLVM backend bring something
useful to Valgrind (for instance, faster execution or more targets
supported) ?

I've sent this message to the LLVM and Valgrind mailing lists because
I've originally found the idea on the LLVM's website, but Valgrind is
the object of the idea. By the way, does anyone already know if LLVM or
Valgrind will be a mentoring organization for this year's GSoC?

You can find in [2] the list of my past projects. During the GSoC 2011,
I had the chance to use the Clang libraries to compile C code, and the
LLVM JIT to execute it (with instrumented stdlib functions). I have also
played with the LLVM C bindings to generate code when I explored some
parts of Mesa.

Denis Steckelmacher

[1] : http://llvm.org/OpenProjects.html#misc_new
[2] : http://steckdenis.be/page-projects.html

=== End of the first message ===

The idea of using LLVM backend in some dynamic binary translation (DBT)
project has became popular recently. Unfortunately it does not prove
to be good.

I suggest you check the related work in QEMU. DBT part of both QEMU and
Valgrind works in similar way. And there were a bunch of works on using
LLVM as a QEMU backend. They resulted in slowdown mostly. In [1] the
authors reported 35x slowdown, in [2] there were around 2x slowdown.
Finally in [3] the authors reported performance gain, but there are some
catches.

1. They used LLVM not only for backend. They replaced internal
representation with LLVM. This is not an option for Valgrind because
you'll need to rewrite all existing tools (including third party ones)
to do it.

2. They use SPEC CPU benchmarks to measure their speedup. Important
things about these tests is that they have little code to translate but
a lot of computations to do with translated code. And even some of these
tests are not doing too well (like 403.gcc). On real life applications
(like firefox) where there are a lot of library code to translate and
not so much computations to do results may be totally different.

LLVM is not doing good as a DBT backend mostly for two reasons.

First, in DBT you need to translate while you are running the
application. You need to do it really fast. Compiler is not optimized
for that task. LLVM JIT? May be.

Second, in DBT you translate code in small portions like basic blocks,
or extended basic blocks. They have very simple structure. There is no
loops, there is no redundancy from translation high level language to
low level. There is nothing good sophisticated optimizations can do
better then very simple ones.

In conclusion I second what have already been said: this project sounds
like fun to do, but do not expect much practical results from it.

It would also be interesting to cache the LLVM-generated code
between runs

The tricky part here is to build matching between binary code fragments
and cached translations from previous runs. In worst case all you know
about the binary code is it's address (which can vary between runs) and
the binary code itself.

[1] : "Dynamically Translating x86 to LLVM using QEMU"
http://infoscience.epfl.ch/record/149975/files/x86-llvm-translator-chipounov_2.pdf
[2] : llvm-qemu project.
http://code.google.com/p/llvm-qemu/
[3] : "LnQ: Building High Performance Dynamic Binary Translator
       with Existing Compiler Backends"
http://people.cs.nctu.edu.tw/~chenwj/slide/paper/lnq.pdf

I tend to agree with Kirill. It would be great to make Valgrind/Memcheck
faster, and there are certainly ways to do that, but using LLVM is not
one of them.

Second, in DBT you translate code in small portions like basic blocks,
or extended basic blocks. They have very simple structure. There is no
loops, there is no redundancy from translation high level language to
low level. There is nothing good sophisticated optimizations can do
better then very simple ones.

Yes. One of the problems of the "Let's use LLVM and it'll all go much
faster" concept is that it lacks a careful analysis of what makes Valgrind
(and QEMU, probably) run slowly in the first place.

As Kirill says, the short blocks of code that V generates make it
impossible for LLVM to do sophisticated loop optimisations etc.
Given what Valgrind's JIT has to work with -- straight line pieces
of code -- it generally does a not-bad job of instruction selection
and register allocation, and I wouldn't expect that substituting LLVM's
implementation thereof would make much of a difference.

What would make Valgrind faster is

(1) improve the caching of guest registers in host registers across
    basic block boundaries. Currently all guest registers cached in
    host registers are flushed back into memory at block boundaries,
    and no host register holds any live value across the boundary.
    This is simple but very suboptimal, creating large amounts of
    memory traffic.

(2) improve the way that the guest program counter is represented.
    Currently it is updated before every memory access, so that if an
    unwind is required, it is possible. But this again causes lots of
    excess memory traffic. This is closely related to (1).

(3) add some level of control-flow if-then-else support to the IR, so
    that the fast-case paths for the memcheck helper functions
    (helperc_LOADV64le etc) can be generated inline.

(4) Redesign Memcheck's shadow memory implementation to use a 1 level
    map rather than 2 levels as at present. Or something more
    TLB-like.

I suspect that the combination of (1) and (2) causes processor write
buffers to fill up and start stalling, although I don't have numbers
to prove that. What _is_ very obvious from profiling Memcheck using
Cachegrind is that the generated code contains much higher proportion
of memory references than "normal integer code". And in particular
it contains perhaps 4 times as many stores as "normal integer code".
Which can't be a good thing.

(3) is a big exercise -- much work -- but potentially very beneficial.
(4) is also important if only because we need a multithreaded
implementation of Memcheck. (1) and (2) are smaller projects and would
constitute a refinement of the existing code generation framework.

In conclusion I second what have already been said: this project sounds
like fun to do, but do not expect much practical results from it.

The above projects (1) .. (4) would also be fun :slight_smile: and might generate more
immediate speedups for Valgrind.

J

For (3), would something like making all statements conditional (like LoadG, StoreG, and Exit are) do, or are we talking about something more complex?

Something more complex: being able to add control-flow diamonds
(if-then-else-merge) into the IR. Then, for example, the load-cases
for Memcheck could be put inline: fast/slow case check before the
diamond, the fast case code in the then branch, the slow case code
calling a helper in the else branch.

Doing control flow diamonds in IR means that both the IR optimiser
and the register allocator will have to deal with control flow
merges, which they don't at present. That would make them more complex,
although not as complex as they would be if they had to deal with
loops as well.

J