Adding support for self-modifying branches to LLVM?

Hi,

I’m thinking about using LLVM to implement a limited form of self-modifying code. Before diving into that, I’d like to get some feedback from you all.

The goal: I’d like to add “optional” code to a program that I can enable at runtime and that has zero (i.e., as close to zero as I can get) overhead when not enabled.

Existing solutions: Currently, I can guard optional code using a branch, something like br i1 %cond, label %optional, label %skip, !prof !0. Branch weights ensure that the branch is predicted correctly. The overhead of this is not as low as I’d like, though, because the branch is still present in the code and because computing %cond also has some cost.

The idea: I’d like to have a branch that is the same as the example above, but that gets translated into a nop instruction. Preferably some unique nop that I can easily recognize in the binary, and that has the same size as an unconditional branch instruction. Then, I could use a framework such as DynInst to replace that nop with an unconditional branch instruction at run-time.

My questions to the community would be:

  • Does the idea make sense, or am I missing a much simpler approach?
  • What would be the easiest way to obtain the desired binary? Adding a new TerminatorInstruction sounds daunting, is there something simpler?

I also wonder whether I could even expects speedups from this? Are nop instructions actually cheaper than branches? Would modifying the binary at run-time play well enough with caches etc.? These are probably not questions for the LLVM mailing list, but if anybody has good answers they are welcome.

Looking forward to hearing your thoughts,
Jonas

From: "Jonas Wagner via llvm-dev" <llvm-dev@lists.llvm.org>
To: llvm-dev@lists.llvm.org
Sent: Tuesday, January 19, 2016 11:40:16 AM
Subject: [llvm-dev] Adding support for self-modifying branches to
LLVM?

Hi,
I’m thinking about using LLVM to implement a limited form of
self-modifying code. Before diving into that, I’d like to get some
feedback from you all.
The goal: I’d like to add “optional” code to a program that I can
enable at runtime and that has zero (i.e., as close to zero as I can
get) overhead when not enabled.
Existing solutions: Currently, I can guard optional code using a
branch, something like br i1 %cond, label %optional, label %skip,
!prof !0 . Branch weights ensure that the branch is predicted
correctly. The overhead of this is not as low as I’d like, though,
because the branch is still present in the code and because
computing %cond also has some cost.
The idea: I’d like to have a branch that is the same as the example
above, but that gets translated into a nop instruction. Preferably
some unique nop that I can easily recognize in the binary, and that
has the same size as an unconditional branch instruction. Then, I
could use a framework such as DynInst to replace that nop with an
unconditional branch instruction at run-time.
My questions to the community would be:

* Does the idea make sense, or am I missing a much simpler approach?
* What would be the easiest way to obtain the desired binary? Adding
a new TerminatorInstruction sounds daunting, is there something
simpler?

I also wonder whether I could even expects speedups from this? Are
nop instructions actually cheaper than branches? Would modifying the
binary at run-time play well enough with caches etc.? These are
probably not questions for the LLVM mailing list, but if anybody has
good answers they are welcome.

If you've not already, you'll want to look at this: http://llvm.org/docs/StackMaps.html (it does not quite do what you want, but it should give you some idea on how you might proceed).

-Hal

Hi,

I’m thinking about using LLVM to implement a limited form of
self-modifying code. Before diving into that, I’d like to get some feedback
from you all.

*The goal:* I’d like to add “optional” code to a program that I can
enable at runtime and that has zero (i.e., as close to zero as I can get)
overhead when not enabled.

*Existing solutions:* Currently, I can guard optional code using a
branch, something like br i1 %cond, label %optional, label %skip, !prof !0.
Branch weights ensure that the branch is predicted correctly. The overhead
of this is not as low as I’d like,

How low would you like it? What use case is suffering for performance due
to the branch?

Self-modifying code for truly zero-overhead (when not enabled)
instrumentation is a real thing (look at e.g. DTrace pid provider) but
unless the number of instrumentation point is very large (100's of
thousands? millions?) or not known beforehand (both are true for DTrace),
the cost of a branch will be negligible.

AFAIK, the cost of a well-predicted, not-taken branch is the same as a nop
on every x86 made in the last many years. See
http://www.agner.org/optimize/instruction_tables.pdf
Generally speaking a correctly-predicted not-taken branch is basically
identical to a nop, and a correctly-predicted taken branch is has an extra
overhead similar to an "add" or other extremely cheap operation. More
concerning is that the condition that is branched on is probably some flag
in memory somewhere and will require a memory operation to check it (but of
course on a good OoO w/ speculative execution this doesn't hold up anything
but the retire queue).

-- Sean Silva

Thanks for the information. This has been very useful!

Patch points indeed almost do what I need. I will try to build a similar solution.

In the use case that I have in mind, there are indeed a large number of instrumentation points. To give a concrete example of what I’d like to achieve, consider Clang’s -fsanitize flag. These sanitizers add thousands of little independent bits of code to the program, e.g., memory access checks. Code that is very useful but also slows the program down. I’d like to transform this code into zero-overhead instrumentation that I can enable selectively.

I’m still not 100% sure whether the nop <-> br conversion is the best approach. I’ve considered some alternatives:

  • Branches where the condition is a flag in memory. My early experiments were too slow :frowning:
  • Conditional branches along with some other way to control the flags register. I’m afraid of the side effects this might have.

For now, transforming an llvm.experimental.patchpoint into an unconditional branch looks most promising. I just need to figure out how to trick LLVM into laying out basic blocks the right way (and not eliminating those that are unreachable due to not-yet-transformed branches).

Cheers,
Jonas

Thanks for the information. This has been very useful!

Patch points indeed *almost* do what I need. I will try to build a similar
solution.

Self-modifying code for truly zero-overhead (when not enabled)

instrumentation is a real thing (look at e.g. DTrace pid provider) but
unless the number of instrumentation point is very large (100's of
thousands? millions?) or not known beforehand (both are true for DTrace),
the cost of a branch will be negligible.

In the use case that I have in mind, there are indeed a large number of
instrumentation points. To give a concrete example of what I'd like to
achieve, consider Clang's -fsanitize flag. These sanitizers add thousands
of little independent bits of code to the program, e.g., memory access
checks. Code that is very useful but also slows the program down. I'd like
to transform this code into zero-overhead instrumentation that I can enable
selectively.

Your initial idea (compile with branches, then convert to nops) won't truly
be zero-overhead because that still requires having the instrumentation
present in the IR, which inhibits optimization. Actually, for UBSan I think
this is the primary mechanism by which it "slows code down".

Also, for sanitizers that use shadow, they really "want" instrumentation to
be enabled "everywhere". E.g. for ASan if one part of the code is not
instrumented, it will fail to detect a bug that it otherwise would have
found (false negative). For MSan it is even worse because if any part of
the code is not instrumented, it will give a false positive.
For example, just compiling a single function with ASan/MSan
instrumentation isn't particularly useful, since that function probably
uses memory from another part of the program and without having that other
code instrumented the checks aren't particularly useful.

AFAIK, the cost of a well-predicted, not-taken branch is the same as a nop

on every x86 made in the last many years.

I'm still not 100% sure whether the `nop <-> br` conversion is the best
approach. I've considered some alternatives:
- Branches where the condition is a flag in memory. My early experiments
were too slow :frowning:
- Conditional branches along with some other way to control the flags
register. I'm afraid of the side effects this might have.

For now, transforming an `llvm.experimental.patchpoint` into an
unconditional branch looks most promising. I just need to figure out how to
trick LLVM into laying out basic blocks the right way (and not eliminating
those that are unreachable due to not-yet-transformed branches).

Using patchpoint et al. seems like a much better fit for your use case (as
I've understood it).

-- Sean Silva

Hi,

This is very interesting! Do you know of any studies that measure this kind of effect?

There is some data on this, e.g, in “High System-Code Security with Low Overhead”. In this work we found that, for ASan as well as other instrumentation tools, most overhead comes from the checks. Especially for CPU-intensive applications, the cost of maintaining shadow memory is small.

I’m happy to discuss this further. Also, if there are more suggestions on how to best implement this, let me know!

  • Jonas

Specifically on this point only: While absolutely true for most micro-benchmarks, this is less true at large scale. I’ve definitely seen removing a highly predictable branch (in many, many places, some of which are hot) to benefit performance in the 5-10% range. For instance, removing highly predictable branches is the primary motivation of implicit null checking. (). Where exactly the performance improvement comes from is hard to say, but, empirically, it does matter. (Caveat to above: I have not run an experiment that actually put in the same number of bytes in nops. It’s possible the entire benefit I mentioned is code size related, but I doubt it given how many ticks a sample profiler will show on said branches.) p.s. Sean mentions down-thread that most of the slowdown from checks is in the effect on the optimizer, not the direct impact of the instructions emitted. This is absolutely our experience as well. I don’t intend for anything I said above to imply otherwise. Philip

Hi,

Your initial idea (compile with branches, then convert to nops) won't

truly be zero-overhead because that still requires having the
instrumentation present in the IR, which inhibits optimization. Actually,
for UBSan I think this is the primary mechanism by which it "slows code
down".

This is very interesting! Do you know of any studies that measure this
kind of effect?

No (but I haven't looked).

Also, for sanitizers that use shadow, they really "want" instrumentation

to be enabled "everywhere".

There is some data on this, e.g, in "High System-Code Security with Low
Overhead" <http://dslab.epfl.ch/proj/asap/#publications>. In this work we
found that, for ASan as well as other instrumentation tools, most overhead
comes from the checks. Especially for CPU-intensive applications, the cost
of maintaining shadow memory is small.

How did you measure this? If it was measured by removing the checks before
optimization happens, then what you may have been measuring is not the
execution overhead of the branches (which is what would be eliminated by
nop'ing them out) but the effect on the optimizer.

-- Sean Silva

AFAIK, the cost of a well-predicted, not-taken branch is the same as a nop
on every x86 made in the last many years. See
http://www.agner.org/optimize/instruction_tables.pdf
<http://www.agner.org/optimize/instruction_tables.pdf>
Generally speaking a correctly-predicted not-taken branch is basically
identical to a nop, and a correctly-predicted taken branch is has an extra
overhead similar to an "add" or other extremely cheap operation.

Specifically on this point only: While absolutely true for most
micro-benchmarks, this is less true at large scale. I've definitely seen
removing a highly predictable branch (in many, many places, some of which
are hot) to benefit performance in the 5-10% range. For instance, removing
highly predictable branches is the primary motivation of implicit null
checking. (http://llvm.org/docs/FaultMaps.html). Where exactly the
performance improvement comes from is hard to say, but, empirically, it
does matter.

(Caveat to above: I have not run an experiment that actually put in the
same number of bytes in nops. It's possible the entire benefit I mentioned
is code size related, but I doubt it given how many ticks a sample profiler
will show on said branches.)

Interesting. Another possible explanation is that these extra branches
cause contention on branch-prediction resources. In the past when talking
with Dan about WebAssembly sandboxing, IIRC he said that they found about
15% overhead, due primarily to branch-prediction resource contention. In
fact I think they had a pretty clear idea of wanting a new instruction
which is just a "statically predict never taken and don't use any
branch-prediction resources" branch (this is on x86 IIRC; some arches
actually obviously have such an instruction!).

-- Sean Silva

I’ve heard and proposed this explanation in the past as well, but I’ve never heard of anyone able to categorically answer the question. The other explanation I’ve considered is that the processor has a finite speculation depth (i.e. how many in flight predicted branches), and the extra branches cause the processor to not be able to speculate “interesting” branches because they’re full of uninteresting ones. However, my hardware friends tell me this is a somewhat questionable explanation since the check branches should be easy to satisfy and retire quickly. 15% seems a bit high to me, but I don’t have anything concrete to share here unfortunately. This has been on my wish list for a while. It would make many things so much easier. The sickly amusing bit is that x86 has two different forms of this, neither of which actually work: 1) There are prefixes for branches which are supposed to control the prediction direction. My understanding is that code which tried using them was so often wrong, that modern chips interpret them as nop padding. We actually use this to produce near arbitrary length nops. :slight_smile: 2) x86 (but not x86-64) had a “into” instruction which triggered an interrupt if the overflow bit is set. (Hey, signal handlers are just weird branches right? :p) However, this does not work as designed in x86-64. My understanding is that the original AMD implementation had a bug in this instruction and the bug essentially got written into the spec for all future chips. :frowning: Philip

Hello,

There is some data on this, e.g, in “High System-Code Security with Low Overhead”. In this work we found that, for ASan as well as other instrumentation tools, most overhead comes from the checks. Especially for CPU-intensive applications, the cost of maintaining shadow memory is small.

How did you measure this? If it was measured by removing the checks before optimization happens, then what you may have been measuring is not the execution overhead of the branches (which is what would be eliminated by nop’ing them out) but the effect on the optimizer.

Interesting. Indeed this was measured by removing some checks and then re-optimizing the program.

I’m aware of some impact checks may have on optimization. For example, I’ve seen cases where much less inlining happens because functions with checks are larger. Do you know other concrete examples? This is definitely something I’ll have to be careful about. Philip Reames confirms this, too.

On the other hand, we’ve also found that the benefit from removing a check is roughly proportional to the number of cycles spent executing that check’s instructions. Our model of this is not very precise, but it shows that the cost of executing the check’s instructions matters.

I’ll try to measure this, and will come back when I have data.

Best,
Jonas

Hello,

There is some data on this, e.g, in “High System-Code Security with Low
Overhead” <http://dslab.epfl.ch/proj/asap/#publications>. In this work we
found that, for ASan as well as other instrumentation tools, most overhead
comes from the checks. Especially for CPU-intensive applications, the cost
of maintaining shadow memory is small.

How did you measure this? If it was measured by removing the checks before
optimization happens, then what you may have been measuring is not the
execution overhead of the branches (which is what would be eliminated by
nop’ing them out) but the effect on the optimizer.

Interesting. Indeed this was measured by removing some checks and then
re-optimizing the program.

I’m aware of some impact checks may have on optimization. For example,
I’ve seen cases where much less inlining happens because functions with
checks are larger. Do you know other concrete examples? This is definitely
something I’ll have to be careful about. Philip Reames confirms this, too.

Off the top of my head:

- CFG is more complex
- regalloc is harder (although in theory it needn't do a worse job since
the branches are marked as unlikely; but not sure about in practice)
- the branch splits a BB which otherwise would have stayed together,
restricting e.g. the scheduler (which is BB-at-a-time last I heard)

-- Sean Silva

From: "Sean Silva via llvm-dev" <llvm-dev@lists.llvm.org>
To: "Jonas Wagner" <jonas.wagner@epfl.ch>
Cc: "llvm-dev" <llvm-dev@lists.llvm.org>
Sent: Thursday, January 21, 2016 6:53:18 PM
Subject: Re: [llvm-dev] Adding support for self-modifying branches to
LLVM?

> Hello,

> > > There is some data on this, e.g, in “High System-Code Security
> > > with
> > > Low Overhead” . In this work we found that, for ASan as well as
> > > other instrumentation tools, most overhead comes from the
> > > checks.
> > > Especially for CPU-intensive applications, the cost of
> > > maintaining
> > > shadow memory is small.
> >
>

> > How did you measure this? If it was measured by removing the
> > checks
> > before optimization happens, then what you may have been
> > measuring
> > is not the execution overhead of the branches (which is what
> > would
> > be eliminated by nop’ing them out) but the effect on the
> > optimizer.
>

> Interesting. Indeed this was measured by removing some checks and
> then re-optimizing the program.

> I’m aware of some impact checks may have on optimization. For
> example, I’ve seen cases where much less inlining happens because
> functions with checks are larger. Do you know other concrete
> examples? This is definitely something I’ll have to be careful
> about. Philip Reames confirms this, too.

Off the top of my head:

- CFG is more complex
- regalloc is harder (although in theory it needn't do a worse job
since the branches are marked as unlikely; but not sure about in
practice)
- the branch splits a BB which otherwise would have stayed together,
restricting e.g. the scheduler (which is BB-at-a-time last I heard)

Yes, the scheduling is per-BB (it is also interrupted by function calls). :frowning:

-Hal

------------------------------

*From: *"Sean Silva via llvm-dev" <llvm-dev@lists.llvm.org>
*To: *"Jonas Wagner" <jonas.wagner@epfl.ch>
*Cc: *"llvm-dev" <llvm-dev@lists.llvm.org>
*Sent: *Thursday, January 21, 2016 6:53:18 PM
*Subject: *Re: [llvm-dev] Adding support for self-modifying branches to
LLVM?

Hello,

There is some data on this, e.g, in “High System-Code Security with Low
Overhead” <http://dslab.epfl.ch/proj/asap/#publications>. In this work
we found that, for ASan as well as other instrumentation tools, most
overhead comes from the checks. Especially for CPU-intensive applications,
the cost of maintaining shadow memory is small.

How did you measure this? If it was measured by removing the checks
before optimization happens, then what you may have been measuring is not
the execution overhead of the branches (which is what would be eliminated
by nop’ing them out) but the effect on the optimizer.

Interesting. Indeed this was measured by removing some checks and then
re-optimizing the program.

I’m aware of some impact checks may have on optimization. For example,
I’ve seen cases where much less inlining happens because functions with
checks are larger. Do you know other concrete examples? This is definitely
something I’ll have to be careful about. Philip Reames confirms this, too.

Off the top of my head:

- CFG is more complex
- regalloc is harder (although in theory it needn't do a worse job since
the branches are marked as unlikely; but not sure about in practice)
- the branch splits a BB which otherwise would have stayed together,
restricting e.g. the scheduler (which is BB-at-a-time last I heard)

Yes, the scheduling is per-BB (it is also interrupted by function calls).
:frowning:

:frowning:

-- Sean Silva

Hi,

I’m coming back to this old thread with data about the performance of NOPs. Recalling that I was considering transforming NOP instructions into branches and back, in order to dynamically enable code. One use case for this was enabling/disabling individual sanitizer checks (ASan, UBSan) on demand.

I wrote a pass which takes an ASan-instrumented program, and replaces each ASan check with an llvm.experimental.patchpoint intrinsic. This intrinsic inserts a NOP of configurable size. It has otherwise no effect on the program semantics. It does prevent some optimizations, presumably because instructions cannot be moved across the patchpoint.

Some results:

  • On SPEC, patchpoints introduce an overhead of ~25% compared to a version where ASan checks are removed.
  • This is almost half of the cost of the checks themselves.
  • The results are similar for NOPs of size 1 and 5 bytes.
  • Interestingly, the results are similar for NOPs of 0 bytes, too. These are patchpoints that don’t insert any code and only inhibit optimizations. I’ve only tested this on one benchmark, though.

To summarize, only part of the cost of NOPs is due to executing them. Their effect on optimizations is significant, too. I guess this would hold for branches and sanitizer checks as well.

Best,
Jonas

I don’t think you can really draw strong conclusions from the experiments you described. What you’ve ended up measuring is nearly the impact of not optimizing over patchpoints at the check locations. This doesn’t really tell you much about what a check (which is likely to inhibit optimization much less) costs over a nop at the same position. One bit of data you could extract from the experiment as constructed would be the relative cost of extra nops. You do mention that the results are similar for sizes 1-5 bytes, but similar is very vague in this context. Are the results statistically indistinguishable? Or is there a noticeable but small slowdown that results? (Numbers would be great here.)

Hi,

I'm coming back to this old thread with data about the performance of
NOPs. Recalling that I was considering transforming NOP instructions into
branches and back, in order to dynamically enable code. One use case for
this was enabling/disabling individual sanitizer checks (ASan, UBSan) on
demand.

I wrote a pass which takes an ASan-instrumented program, and replaces each
ASan check with an llvm.experimental.patchpoint intrinsic. This intrinsic
inserts a NOP of configurable size. It has otherwise no effect on the
program semantics. It does prevent some optimizations, presumably because
instructions cannot be moved across the patchpoint.

Some results:
- On SPEC, patchpoints introduce an overhead of ~25% compared to a version
where ASan checks are removed.
- This is almost half of the cost of the checks themselves.
- The results are similar for NOPs of size 1 and 5 bytes.
- Interestingly, the results are similar for NOPs of 0 bytes, too. These
are patchpoints that don't insert any code and only inhibit optimizations.
I've only tested this on one benchmark, though.

To summarize, only part of the cost of NOPs is due to executing them.
Their effect on optimizations is significant, too. I guess this would hold
for branches and sanitizer checks as well.

I don't think you can really draw strong conclusions from the experiments
you described. What you've ended up measuring is nearly the impact of not
optimizing over patchpoints at the check locations. This doesn't really
tell you much about what a check (which is likely to inhibit optimization
much less) costs over a nop at the same position.

One bit of data you could extract from the experiment as constructed would
be the relative cost of extra nops. You do mention that the results are
similar for sizes 1-5 bytes, but similar is very vague in this context.
Are the results statistically indistinguishable? Or is there a noticeable
but small slowdown that results? (Numbers would be great here.)

In this same vein, try inserting 1,2,3,4,5,6,... nops and measure the
performance impact (the total size of nops is also interesting but is more
difficult to measure reliably). I've used this kind of technique
successfully in the past for e.g. measuring the cost of "stat" syscalls on
windows. I call the technique "stuffing". Basically, make a plot of the
performance degradation as you insert more and more redundant stuff (e.g. 1
nop, 2 nops, 3 nops, etc.). If the result is a strong linear trend, then
you can pretty confidently extrapolate backward to the "0 nop" case to see
the overhead of inserting 1 nop.

-- Sean Silva