[RFC] A new intrinsic, `llvm.blackbox`, to explicitly prevent constprop, die, etc optimizations

Hey all,

I’d like to propose a new intrinsic for use in preventing optimizations from deleting IR due to constant propagation, dead code elimination, etc.

Background/Motivation

In Rust we have a crate called test which provides a function, black_box, which is designed to be a no-op function that prevents constprop, die, etc from interfering with tests/benchmarks but otherwise doesn’t negatively affect resulting machine code quality. test currently implements this function by using inline asm, which marks a pointer to the argument as used by the assembly.

At the IR level, this creates an alloca, stores it’s argument to it, calls the no-op inline asm with the alloca pointer, and then returns a load of the alloca. Obviously, mem2reg would normally optimize this sort of pattern away, however the deliberate use of the no-op asm prevents other desirable optimizations (such as the aforementioned mem2reg pass) a little too well.

Existing and upcoming virtual ISA targets also don’t have this luxury (PNaCl/JS and WebAssembly, respectively). For these kind of targets, Rust’s test currently forbids inlining of black_box, which crudely achieves the same effect. This is undesirable for any target because of the associated call overhead.

The IR for test::black_box::<i32> is currently (it gets inlined, as desired, so I’ve omitted the function signature):

%dummy.i = alloca i32, align 4

%2 = bitcast i32* %dummy.i to i8*

call void @llvm.lifetime.start(i64 4, i8* %2) #1
; Here, the value operand was the original argument to `test::black_box::<i32>`
store i32 2, i32* %dummy.i, align 4
call void asm "", "r,~{dirflag},~{fpsr},~{flags}"(i32* %dummy.i) #1, !srcloc !0
%3 = load i32, i32* %dummy.i, align 4
call void @llvm.lifetime.end(i64 4, i8* %2) #1

This could be better.

Solution

Add a new intrinsic, called llvm.blackbox, which accepts a value of any type and returns a value of the same type. As with many other intrinsics, this intrinsic shall remain unknown to all optimizations, before and during codegen. Specifically, this intrinsic should prevent all optimizations which operate by assuming properties of the value passed to the intrinsic. Once the last optimization pass (of any kind) is finished, all calls can be RAUW its argument.

Table-gen def:

def int_blackbox : Intrinsic<[llvm_any_ty], [LLVMMatchType<0>]>;

Thus, using the previous example, %3 would become:

%3 = call i32 @llvm.blackbox.i32(i32 2)

Why does this need to be an intrinsic (as opposed to generic "unknown function" to llvm)?

Secondly, have you looked into a volatile store / load to an alloca? That should work with PNaCl and WebAssembly.

E.g.

define i32 @blackbox(i32 %arg) {
  entry:
   %p = alloca i32
   store volatile i32 10, i32* %p ;; or store %arg
   %v = load volatile i32, i32* %p
   ret i32 %v
}

-- Sanjoy

That volatility would have a negative performance impact.

Richard Diamond

I’m very unclear and why you think a generic black box intrinsic will have any different performance impact :wink:

I’m also unclear on what the goal with this intrinsic is.
I understand the symptoms you are trying to solve - what exactly is the disease.

IE you say "

I’d like to propose a new intrinsic for use in preventing optimizations from deleting IR due to constant propagation, dead code elimination, etc."

But why are you trying to achieve this goal?
Benchmarks that can be const prop’d/etc away are often meaningless. Past that, if you want to ensure a particular optimization does a particular thing on a benchmark, ISTM it would be better to generate the IR, run opt (or build your own pass-by-pass harness), and then run “the passes you want on it” instead of “trying to stop certain passes from doing things to it”.

This would not prevent dead code elimination from removing it. The intrinsic would need to have some sort of a side-effect in order to be preserved in all cases. Are you concerned about cases where the user of the intrinsic is dead?

-Krzysztof

To add on to what Danny and Krzysztof have said, this proposal doesn’t make a lot of sense to me. You want this intrinsic to inhibit (some) optimizations, but you simultaneously want it not to have a performance impact. Those are contradictory goals. Worse, the proposal doesn’t specify what optimizations should/should not be allowed for this intrinsic, since apparently you want at least some applied. Is CSE allowed? DCE? PRE?

—Owen

I'm very unclear and why you think a generic black box intrinsic will have
any different performance impact :wink:

I'm also unclear on what the goal with this intrinsic is.
I understand the symptoms you are trying to solve - what exactly is the
disease.

IE you say "

I'd like to propose a new intrinsic for use in preventing optimizations
from deleting IR due to constant propagation, dead code elimination, etc."

But why are you trying to achieve this goal?

It's a cleaner design than current solutions (as far as I'm aware).

Benchmarks that can be const prop'd/etc away are often meaningless.

A benchmark that's completely removed is even more meaningless, and the
developer may not even know it's happening. I'm not saying this intrinsic
will make all benchmarks meaningful (and I can't), I'm saying that it would
be useful in Rust in ensuring that tests/benches aren't invalidated simply
because a computation wasn't performed.

Past that, if you want to ensure a particular optimization does a

particular thing on a benchmark, ISTM it would be better to generate the
IR, run opt (or build your own pass-by-pass harness), and then run "the
passes you want on it" instead of "trying to stop certain passes from doing
things to it".

True, but why would you want to force that speed bump onto other
developers? I'd argue that's more hacky than the inline asm.

I apologize for the confusion. I don't think the goals are contradictory.
We're talking about code the developer *specifically* doesn't want
optimized away, but otherwise doesn't care about what optimization
transforms are employed. So yes, I want it to inhibit some optimizations,
but without otherwise having a performance impact outside of the obviously
prevented optimizations.

PRE would be fine, as long as the expression in question doesn't make a
call to this intrinsic.

I don’t see how this is any different from volatile markers on loads/stores or memory barriers or several other optimizer blocking devices. They generally end up crippling the optimizers without much added benefit.

Would it be possible to stop the code motion you want to block by explicitly exposing data dependencies? Or simply disabling some optimizations with pragmas?

Diego.

I'm very unclear and why you think a generic black box intrinsic will
have any different performance impact :wink:

I'm also unclear on what the goal with this intrinsic is.
I understand the symptoms you are trying to solve - what exactly is the
disease.

IE you say "

I'd like to propose a new intrinsic for use in preventing optimizations
from deleting IR due to constant propagation, dead code elimination, etc."

But why are you trying to achieve this goal?

It's a cleaner design than current solutions (as far as I'm aware).

For what, exact, well defined goal?

Trying to make certain specific optimizations not work does not seem like a
goal unto itself.
It's a thing you are doing to achieve something else, right?
(Because if not, it has a very well defined and well supported solutions -
set up a pass manager that runs the passes you want)

What is the something else?

IE what is the problem that led you to consider this solution.

Benchmarks that can be const prop'd/etc away are often meaningless.

A benchmark that's completely removed is even more meaningless, and the
developer may not even know it's happening.

Write good benchmarks?

No, seriously, i mean, you want benchmarks that tests what users will see
when the compiler works, not benchmarks that test what users see if the
were to suddenly turn off parts of the optimizers :wink:

I'm not saying this intrinsic will make all benchmarks meaningful (and I
can't), I'm saying that it would be useful in Rust in ensuring that
tests/benches aren't invalidated simply because a computation wasn't
performed.

Past that, if you want to ensure a particular optimization does a

particular thing on a benchmark, ISTM it would be better to generate the
IR, run opt (or build your own pass-by-pass harness), and then run "the
passes you want on it" instead of "trying to stop certain passes from doing
things to it".

True, but why would you want to force that speed bump onto other
developers? I'd argue that's more hacky than the inline asm.

Speed bump? Hacky?

It's a completely normal test harness?

That's in fact, why llvm uses it as a test harness?

I guess i don't see why an intrinsic with not well defined semantics, used
in weird ways to try to outsmart some but not all optimizations, is "less
hacky" than a harness that says "hey, i want to see the effects of running
mem2reg and code gen on this, without running constprop. So i'm just going
to run mem2reg and codegen on this, and see the results!".
Because the former is just a way to try to magic the compiler, and the
second expresses exactly what you want.

The common use case I’ve seen for a black box like construct is when writing microbenchmarks. In particular, you’re generally looking for a way to “sink” the result of a computation without having that sink outweigh the cost of the thing you’re trying to measure.

Common alternate approaches are to use a volatile store (so that it can’t be eliminated or sunk out of loops) or a call to an external function with a cheap calling convention.

As an example:
int a = 5; // initialization is not visible to compiler
int b = 7;
void add_two_globals()
sink(a+b)
}

If what I’m look into is the code generation around addition, this is a very useful way of testing the entire compiler - frontend, middle end, and backend.

I’ll note that we use such a framework extensively.

What I’m not clear on is why this needs to be an intrinsic. Why does a call to an external function or a volatile store not suffice?

Philip

I wonder the same. Richard, maybe we just need something more specific
in Rust? Like something that only clobbers memory? Or just the
variable? Seems like we could do with specialized "block_box`
functions. AIUI our `black_box` got extended to prevent more
optimizations as it became obvious that the compiler could still
"defeat" it. Maybe we need to take a step back and say "ok, we'll have
to think a bit harder and decide how hard we'll be on the optimizer"?
Is there anything that speaks against that and requires an intrinsic?

Cheers,
Björn

I don't see how this is any different from volatile markers on
loads/stores or memory barriers or several other optimizer blocking
devices. They generally end up crippling the optimizers without much added
benefit.

Volatile must touch memory (right?). Memory is slow.

Would it be possible to stop the code motion you want to block by
explicitly exposing data dependencies? Or simply disabling some
optimizations with pragmas?

Code motion would be fine in theory, though as has been proposed, this
intrinsic would prevent it (because there isn't an attribute that doesn't
allow dead code removal but still permits reordering, as far as I'm aware).
Rust doesn't have pragmas, and besides, that would also affect the whole
module (or the whole crate, to use Rust's vernacular), whereas this
intrinsic would be used in a much more targeted manner (ie at the SSA value
level) by the developer and leave the rest of the module unmolested.

I'm very unclear and why you think a generic black box intrinsic will
have any different performance impact :wink:

I'm also unclear on what the goal with this intrinsic is.
I understand the symptoms you are trying to solve - what exactly is the
disease.

IE you say "

I'd like to propose a new intrinsic for use in preventing optimizations
from deleting IR due to constant propagation, dead code elimination, etc."

But why are you trying to achieve this goal?

It's a cleaner design than current solutions (as far as I'm aware).

For what, exact, well defined goal?

Trying to make certain specific optimizations not work does not seem like
a goal unto itself.
It's a thing you are doing to achieve something else, right?
(Because if not, it has a very well defined and well supported solutions -
set up a pass manager that runs the passes you want)

What is the something else?

IE what is the problem that led you to consider this solution.

I apologize if I'm not being clear enough. This contrived example

#[bench]
fn bench_xor_1000_ints(b: &mut Bencher) {
    b.iter(|| {
        (0..1000).fold(0, |old, new| old ^ new);
    });
}

is completely optimized away. Granted, IRL production (ignoring the
question of why this code was ever used in production in the first place)
this optimization is desired, but here it leads to bogus measurements (ie
0ns per iteration). By using `test::black_box`, one would have

#[bench]
fn bench_xor_1000_ints(b: &mut Bencher) {
    b.iter(|| {
        let n = test::black_box(1000);  // optional
        test::black_box((0..n).fold(0, |old, new| old ^ new));
    });
}

and the microbenchmark wouldn't have bogos 0ns measurements anymore.

Now, as I stated in the proposal, `test::black_box` currently uses no-op
inline asm to "read" from its argument in a way the optimizations can't
see. Conceptually, this seems like something that should be modelled in
LLVM's IR rather than by hacks higher up the IR food chain because the root
problem is caused by LLVM's optimization passes (most of the time this code
optimization is desired, just not here). Plus, it seems others have used
other tricks to achieve similar effects (ie volatile), so why shouldn't
there be something to model this behaviour?

Benchmarks that can be const prop'd/etc away are often meaningless.

A benchmark that's completely removed is even more meaningless, and the
developer may not even know it's happening.

Write good benchmarks?

No, seriously, i mean, you want benchmarks that tests what users will see
when the compiler works, not benchmarks that test what users see if the
were to suddenly turn off parts of the optimizers :wink:

But users are also not testing how fast deterministic code which LLVM is
completely removing can go. This intrinsic prevents LLVM from correctly
thinking the code is deterministic (or that a value isn't used) so that
measurements are (at the very least, the tiniest bit) meaningful.

I'm not saying this intrinsic will make all benchmarks meaningful (and I

can't), I'm saying that it would be useful in Rust in ensuring that
tests/benches aren't invalidated simply because a computation wasn't
performed.

Past that, if you want to ensure a particular optimization does a

particular thing on a benchmark, ISTM it would be better to generate the
IR, run opt (or build your own pass-by-pass harness), and then run "the
passes you want on it" instead of "trying to stop certain passes from doing
things to it".

True, but why would you want to force that speed bump onto other
developers? I'd argue that's more hacky than the inline asm.

Speed bump? Hacky?

It's a completely normal test harness?

That's in fact, why llvm uses it as a test harness?

I mean I wouldn't write a harness or some other type of workaround for
something like this: Rust doesn't seem to be the first to have encountered
this issue, thus it is nonsensical to require every project using LLVM to
have a separate harness or other workaround so they don't run into this
issue. LLVM's own documentation suggests that adding an intrinsic is the
best choice moving forward anyway: "Adding an intrinsic function is far
easier than adding an instruction, and is transparent to optimization
passes. If your added functionality can be expressed as a function call, an
intrinsic function is the method of choice for LLVM extension." (from
http://llvm.org/docs/ExtendingLLVM.html). That sounds perfect to me.

At anyrate, I apologize for my original hand-wavy-ness; I am young and
inexperienced.

No, it just must not be optimised away. The CPU is still free to cache
it.

Joerg

The common use case I've seen for a black box like construct is when
writing microbenchmarks. In particular, you're generally looking for a way
to "sink" the result of a computation without having that sink outweigh the
cost of the thing you're trying to measure.

Common alternate approaches are to use a volatile store (so that it can't
be eliminated or sunk out of loops) or a call to an external function with
a cheap calling convention.

As an example:
int a = 5; // initialization is not visible to compiler

This intrinsic is designed to provide exactly this (assuming
`__builtin_blackbox` is also added to clang):

int a = __builtin_blackbox(5);

int b = 7;
void add_two_globals()
  sink(a+b)

And:

__builtin_blackbox(a+b);

}

If what I'm look into is the code generation around addition, this is a
very useful way of testing the entire compiler - frontend, middle end, and
backend.

I'll note that we use such a framework extensively.

What I'm not clear on is why this needs to be an intrinsic. Why does a
call to an external function or a volatile store not suffice?

Because Rust's `test::black_box` is generic, it can't be an external
function. Also, code using this will actually be executed, so it's also not
good enough to leave it undefined.

>
> > I don't see how this is any different from volatile markers on
> > loads/stores or memory barriers or several other optimizer blocking
> > devices. They generally end up crippling the optimizers without much
added
> > benefit.
> >
>
> Volatile must touch memory (right?). Memory is slow.

No, it just must not be optimised away. The CPU is still free to cache
it.

Hm, okay. Well, at anyrate it isn't guaranteed; I will do some tests to
check anyway.

Joerg

Great!

You should then test that this happens, and additionally write a test that
can't be optimized away, since the above is apparently not a useful
microbenchmark for anything but the compiler :wink:

Seriously though, there are basically three cases (with a bit of handwaving)

1. You want to test that the compiler optimizes something in a certain
way. The above example, without anything else, you actually want to test
that the compiler optimizes this away completely.
This doesn't require anything except using something like FIleCheck and
producing IR at the end of Rust's optimization pipeline.

2. You want to make the above code into a benchmark, and ensure the
compiler is required to keep the number and relative order of certain
operations.
Use volatile for this.

Volatile is not what you seem to think it is, or may think about it in
terms of what people use it for in C/C++.
volatile in llvm has a well defined meaning:
http://llvm.org/docs/LangRef.html#volatile-memory-accesses

3. You want to get the compiler to only do certain optimizations to your
code.

Yes, you have to either write a test harness (even if that test harness is
"your normal compiler, with certain flags passed"), or use ours, for that
:wink:

It seems like you want #2, so you should use volatile.

But don't conflate #2 and #3.

As said:
If you want the compiler to only do certain things to your code, you should
tell it to only do those things by giving it a pass pipeline that only does
those things. Nothing else is going to solve this problem well.

If you want the compiler to do every optimization it knows to your code,
but want it to maintain the number and relative order of certain
operations, that's volatile.

How would black_box be different from existing mechanism (inline asm, volatile, …)?
If the effect on the optimizer is not different then there is no reason to introduce a new intrinsic just for the sake of it. It has some cost: any optimization has to take this into account.

On this topic, I think Chandler’s talk at CppCon seems relevant: https://www.youtube.com/watch?v=nXaxk27zwlk

The doc is about if you need to extend LLVM, then you should try with intrinsic instead of adding an instruction, it is the “need” part that is not clear here. The doc also states that an intrinsic is transparent to optimization passes, but it is not the case here since you want to prevent optimizations from happening (and you haven’t really specified how to decide what can an optimization do around this intrinsic, because if you don’t teach the optimizer about it, it will treat it as an external function call).

I'm very unclear and why you think a generic black box intrinsic will
have any different performance impact :wink:

I'm also unclear on what the goal with this intrinsic is.
I understand the symptoms you are trying to solve - what exactly is the
disease.

IE you say "

I'd like to propose a new intrinsic for use in preventing optimizations
from deleting IR due to constant propagation, dead code elimination, etc."

But why are you trying to achieve this goal?

It's a cleaner design than current solutions (as far as I'm aware).

For what, exact, well defined goal?

Trying to make certain specific optimizations not work does not seem like
a goal unto itself.
It's a thing you are doing to achieve something else, right?
(Because if not, it has a very well defined and well supported solutions
- set up a pass manager that runs the passes you want)

What is the something else?

IE what is the problem that led you to consider this solution.

I apologize if I'm not being clear enough. This contrived example

#[bench]
fn bench_xor_1000_ints(b: &mut Bencher) {
    b.iter(|| {
        (0..1000).fold(0, |old, new| old ^ new);
    });
}

is completely optimized away. Granted, IRL production (ignoring the
question of why this code was ever used in production in the first place)
this optimization is desired, but here it leads to bogus measurements (ie
0ns per iteration). By using `test::black_box`, one would have

#[bench]
fn bench_xor_1000_ints(b: &mut Bencher) {
    b.iter(|| {
        let n = test::black_box(1000);  // optional
        test::black_box((0..n).fold(0, |old, new| old ^ new));
    });
}

and the microbenchmark wouldn't have bogos 0ns measurements anymore.

I still don't understand what you are trying to test with this.

Are you trying to measure e.g. the performance of

xor %eax, %eax
1:
xor %esi, %eax
dec %esi
jnz 1b

?

are you trying to measure whether the compiler will vectorize this? Are you
trying to test how well the compiler will vectorize this? Are you trying to
measure the compiler's unrolling heuristics? Are you trying to see if the
(0..n).fold(...) machinery gets lowered to a loop? Are you trying to see if
the compiler will reduce it to (n & ((n&1)-1)) ^ ((n ^ (n >> 1))&1) or a
similar closed form expression? (I'm sure that's not the simplest one; just
one I cooked up)

I'm honestly curious.

-- Sean Silva