RFC: Killing undef and spreading poison

Hi,

Thanks everybody that showed up in our talk at the LLVM dev meeting and to
those that provided feedback so far.
The slides are already online:
http://llvm.org/devmtg/2016-11/Slides/Lopes-LongLivePoison.pdf

The main question that some people raised was whether we could have bitwise
poison instead of value-wise poison, since that semantics seems to be more
natural as values continue to be just a bag of bits.
During the talk I didn't have a good answer to this question. We've now
studied this option more thoroughly. Apologies for the delay, but I think
now we have a good handle on the subject.

Ok, so bitwise poison is not a well-defined concept in itself; we still to
define the semantics of the individual operations themselves. We studied two
different options:

1) a bit in the output is poison if flipping any set of poison bits in the
input may yield different values for the output bit.
For example (bitwise notation):
ppp * 000 == 000 (since flipping any of the poison bits cannot yield a
result other than zero)
00p + 00p == 0pp
ppp << 2 == p00
icmp ugt ppp, uint_max == p (special case)

2) (thanks to Sanjoy): for an operation R = A op B, bit R[i] is poison if
bit i of 'A op undef' may yield 0 and 1 (same for 'undef op B').
This one unfortunately breaks some algebraic rewrites like: x * -1 -> 0
- x

I've implemented both semantics in Alive (branches: newsema2 and
bitwise-poison2, respectively) and then we run Alive over a large set of
optimizations to see which were ok and which became wrong.

Many optimizations became wrong with either of the bitwise semantics. But
the important ones are:
- We keep the problem of not being able to increase number of register uses
in an expression. For example, these are wrong: x << 1 -> x + x and x *
(1 + y) -> x + x*y (FMA-like transformation).
- It becomes hard to make SROA correct and keep all the shift optimizations
we have; one of these would be wrong. For example, InstCombine does the
following optimization:
(%x ashr exact C1) shl C2 -> %x shl (C2-C1)
If we assume that shift outputs zero for the bits it introduces when
shifting, then these kind of shift optimizations are all wrong (we would
need those bits to be poison). On the other hand, SROA combines multiple
values with bitwise arithmetic and therefore it requires shift to leave
shifted bits as zero. This is a contradiction, so we can't have both
(unless we use a ton of freeze instructions).

I think these two problems with bitwise poison semantics are a deal breaker.
This kind of semantics opens more problems than it solves.
BTW, besides shift transformations, 'div exact' suffer from similar
problems. E.g., this is wrong: (%X udiv exact %Y) * %Y -> %X.

So, my suggestion is to go for value-wise poison, but maybe with a few
tweaks from what I've presented (based on the feedback we received):
- Have 2 load instructions: one that does a value-wise access (if any bit
is poison in memory, the whole value becomes poison), and another version
that freezes each bit individually (which is just syntactic sugar for a
freeze(load <n x i1>) followed by a bitcast). The latter load is very close
to the load instruction we have today.

Adding this extra instruction has some benefits: it simplifies IR
auto-upgrade, and frontends can move to the new instruction incrementally.
Also, it seems that C++ codegen could use the freezing load in a few places.
On the negative side is that the freezing load is harder to move around and
therefore we would probably need a pass to convert these freezing-loads into
valuewise-poison-loads whenever possible.

It's true that this proposal is also increasing the abstraction level of the
LLVM IR a little bit by explicitly saying that values are *not* a bag of
bits. Therefore, low-level bit manipulation has to be done carefully.
IMHO this slight change is ok, and lower-level optimization can be left to
SDAG/MI, where it's safe to do them and it's probably where they belong.
LLVM IR is typed, and values need to be treated as having a type in its own
right. This is probably more of a change in mindset rather than in the LLVM
code itself. At least I couldn't find anything concrete in LLVM that would
break.

I hope this convinces people that bitwise poison is not a good way to go. I
propose we go forward with the value-wise poison with the modification I've
described above.
Please let us know your thoughts and/or if you have questions. It would be
great to finally close all the issues related with undef and poison that we
have today :slight_smile:

Thanks,
Nuno

From: "Nuno Lopes" <nuno.lopes@ist.utl.pt>
To: llvm-dev@lists.llvm.org
Cc: "Sanjoy Das" <sanjoy@playingwithpointers.com>, "David Majnemer" <david.majnemer@gmail.com>, "John Regehr"
<regehr@cs.utah.edu>, "Chung-Kil Hur" <gil.hur@sf.snu.ac.kr>, "Juneyoung Lee" <juneyoung.lee@sf.snu.ac.kr>,
hfinkel@anl.gov, "Dan Gohman" <dan433584@gmail.com>, "Philip Reames" <listmail@philipreames.com>,
clattner@apple.com, "Chandler Carruth" <chandlerc@google.com>
Sent: Tuesday, December 6, 2016 12:38:26 PM
Subject: RE: RFC: Killing undef and spreading poison

Hi,

Thanks everybody that showed up in our talk at the LLVM dev meeting
and to
those that provided feedback so far.
The slides are already online:
http://llvm.org/devmtg/2016-11/Slides/Lopes-LongLivePoison.pdf

The main question that some people raised was whether we could have
bitwise
poison instead of value-wise poison, since that semantics seems to be
more
natural as values continue to be just a bag of bits.
During the talk I didn't have a good answer to this question. We've
now
studied this option more thoroughly. Apologies for the delay, but I
think
now we have a good handle on the subject.

Ok, so bitwise poison is not a well-defined concept in itself; we
still to
define the semantics of the individual operations themselves. We
studied two
different options:

1) a bit in the output is poison if flipping any set of poison bits
in the
input may yield different values for the output bit.

This is the definition I'd expect.

For example (bitwise notation):
ppp * 000 == 000 (since flipping any of the poison bits cannot yield
a
result other than zero)
00p + 00p == 0pp
ppp << 2 == p00
icmp ugt ppp, uint_max == p (special case)

2) (thanks to Sanjoy): for an operation R = A op B, bit R[i] is
poison if
bit i of 'A op undef' may yield 0 and 1 (same for 'undef op B').
This one unfortunately breaks some algebraic rewrites like: x * -1
  -> 0
- x

I've implemented both semantics in Alive (branches: newsema2 and
bitwise-poison2, respectively) and then we run Alive over a large set
of
optimizations to see which were ok and which became wrong.

Many optimizations became wrong with either of the bitwise semantics.

To be clear, "wrong" here means that the value resulting from the transformation might have more poison bits for some inputs than the original value, correct?

But
the important ones are:
- We keep the problem of not being able to increase number of
register uses
in an expression. For example, these are wrong: x << 1 -> x + x
and x *
(1 + y) -> x + x*y (FMA-like transformation).
- It becomes hard to make SROA correct and keep all the shift
optimizations
we have; one of these would be wrong. For example, InstCombine does
the
following optimization:
(%x ashr exact C1) shl C2 -> %x shl (C2-C1)
If we assume that shift outputs zero for the bits it introduces when
shifting, then these kind of shift optimizations are all wrong (we
would
need those bits to be poison). On the other hand, SROA combines
multiple
values with bitwise arithmetic and therefore it requires shift to
leave
shifted bits as zero. This is a contradiction, so we can't have both
(unless we use a ton of freeze instructions).

I think these two problems with bitwise poison semantics are a deal
breaker.
This kind of semantics opens more problems than it solves.
BTW, besides shift transformations, 'div exact' suffer from similar
problems. E.g., this is wrong: (%X udiv exact %Y) * %Y -> %X.

So, my suggestion is to go for value-wise poison, but maybe with a
few
tweaks from what I've presented (based on the feedback we received):
- Have 2 load instructions: one that does a value-wise access (if
any bit
is poison in memory, the whole value becomes poison), and another
version
that freezes each bit individually (which is just syntactic sugar for
a
freeze(load <n x i1>) followed by a bitcast). The latter load is very
close
to the load instruction we have today.

Adding this extra instruction has some benefits: it simplifies IR
auto-upgrade, and frontends can move to the new instruction
incrementally.
Also, it seems that C++ codegen could use the freezing load in a few
places.

Can we be specific on what "few places" we're talking about? I assume this means at least:

1. Accesses to bit fields (or other places where Clang generates loads larger than the underlying data type)
2. memcpy-like operations
3. Accesses though character pointers? (because character pointers can be used to access data of any type)

On the negative side is that the freezing load is harder to move
around and

Why exactly are these harder to move around?

Thanks again,
Hal

1) a bit in the output is poison if flipping any set of poison bits
in the
input may yield different values for the output bit.

This is the definition I'd expect.

For example (bitwise notation):
ppp * 000 == 000 (since flipping any of the poison bits cannot yield
a
result other than zero)
00p + 00p == 0pp
ppp << 2 == p00
icmp ugt ppp, uint_max == p (special case)

2) (thanks to Sanjoy): for an operation R = A op B, bit R[i] is
poison if
bit i of 'A op undef' may yield 0 and 1 (same for 'undef op B').
This one unfortunately breaks some algebraic rewrites like: x * -1
  -> 0
- x

I've implemented both semantics in Alive (branches: newsema2 and
bitwise-poison2, respectively) and then we run Alive over a large set
of
optimizations to see which were ok and which became wrong.

Many optimizations became wrong with either of the bitwise semantics.

To be clear, "wrong" here means that the value resulting from the transformation might have more poison bits for some inputs than the original value, correct?

The vast majority of the examples were of that case, yes. But we also have a few cases where the transformed expression computes a different value (I don't have a record of these handy, sorry).

So, my suggestion is to go for value-wise poison, but maybe with a
few
tweaks from what I've presented (based on the feedback we received):
- Have 2 load instructions: one that does a value-wise access (if
any bit
is poison in memory, the whole value becomes poison), and another
version
that freezes each bit individually (which is just syntactic sugar for
a
freeze(load <n x i1>) followed by a bitcast). The latter load is very
close
to the load instruction we have today.

Adding this extra instruction has some benefits: it simplifies IR
auto-upgrade, and frontends can move to the new instruction
incrementally.
Also, it seems that C++ codegen could use the freezing load in a few
places.

Can we be specific on what "few places" we're talking about? I assume this means at least:

1. Accesses to bit fields (or other places where Clang generates loads larger than the underlying data type)
2. memcpy-like operations
3. Accesses though character pointers? (because character pointers can be used to access data of any type)

1. Bit fields definitely. Sometimes clang needs to widen loads because of the ABI, but these can probably be changed to vector loads instead.
2. Clang introduces memcpys to copy objects (which may have uninitialized padding). My suggestion (contrary to my previous opinion) is to make memcpy a bit-wise copy to simplify things.
3. For char ptrs, not sure that's needed. You would be loading an i8, which is the type you are asking for, so it doesn't seem necessary.
4. I'm not a clang expert by any means, but I briefly spoke with Richard Smith during the dev meeting and he told me that there are some cases involving initialization of unions in C++ that may also need this freezing load. I don't know the exact details.

On the negative side is that the freezing load is harder to move
around and

Why exactly are these harder to move around?

Because they have an implicit freeze, which is also non-trivial to move. For example, these freezing loads cannot easily be sank into loops, like:
%v = load %p
while (...) { use(%v) }
  =>
while (...) { %v = load %p; use(%v) }

The load inside the loop may return a different value for each uninitialized/poison bit in each loop iteration, while in the original code all iterations would see the same value for %v.

So this load should be reserved for the cases when it's actually needed only. For example, in clang, all loads from the program itself should use the value-wise-poison-load, not this one.
This load is similar as if we were to make undef yield the same value for all uses. Since our load today returns undef for uninitialized memory, the current load would become pretty much equivalent to the load we are proposing.
(we already established that having undef is not a good idea; this is just an analogy that will hopefully give some intuition and help understanding the concept of this new load)

Nuno

From: "Nuno Lopes" <nuno.lopes@ist.utl.pt>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "Sanjoy Das" <sanjoy@playingwithpointers.com>, "David Majnemer" <david.majnemer@gmail.com>, "John Regehr"
<regehr@cs.utah.edu>, "Chung-Kil Hur" <gil.hur@sf.snu.ac.kr>, "Juneyoung Lee" <juneyoung.lee@sf.snu.ac.kr>, "Dan
Gohman" <dan433584@gmail.com>, "Philip Reames" <listmail@philipreames.com>, clattner@apple.com, "Chandler Carruth"
<chandlerc@google.com>, llvm-dev@lists.llvm.org
Sent: Wednesday, December 7, 2016 5:27:31 PM
Subject: Re: Killing undef and spreading poison

>> 1) a bit in the output is poison if flipping any set of poison
>> bits
>> in the
>> input may yield different values for the output bit.
>
> This is the definition I'd expect.
>
>> For example (bitwise notation):
>> ppp * 000 == 000 (since flipping any of the poison bits cannot
>> yield
>> a
>> result other than zero)
>> 00p + 00p == 0pp
>> ppp << 2 == p00
>> icmp ugt ppp, uint_max == p (special case)
>>
>> 2) (thanks to Sanjoy): for an operation R = A op B, bit R[i] is
>> poison if
>> bit i of 'A op undef' may yield 0 and 1 (same for 'undef op B').
>> This one unfortunately breaks some algebraic rewrites like: x * -1
>> -> 0
>> - x
>>
>>
>> I've implemented both semantics in Alive (branches: newsema2 and
>> bitwise-poison2, respectively) and then we run Alive over a large
>> set
>> of
>> optimizations to see which were ok and which became wrong.
>>
>> Many optimizations became wrong with either of the bitwise
>> semantics.
>
> To be clear, "wrong" here means that the value resulting from the
> transformation might have more poison bits for some inputs than the
> original value, correct?

The vast majority of the examples were of that case, yes. But we
also have
a few cases where the transformed expression computes a different
value (I
don't have a record of these handy, sorry).

Those are just bugs, right?

>> So, my suggestion is to go for value-wise poison, but maybe with a
>> few
>> tweaks from what I've presented (based on the feedback we
>> received):
>> - Have 2 load instructions: one that does a value-wise access (if
>> any bit
>> is poison in memory, the whole value becomes poison), and another
>> version
>> that freezes each bit individually (which is just syntactic sugar
>> for
>> a
>> freeze(load <n x i1>) followed by a bitcast). The latter load is
>> very
>> close
>> to the load instruction we have today.
>>
>> Adding this extra instruction has some benefits: it simplifies IR
>> auto-upgrade, and frontends can move to the new instruction
>> incrementally.
>> Also, it seems that C++ codegen could use the freezing load in a
>> few
>> places.
>
> Can we be specific on what "few places" we're talking about? I
> assume this
> means at least:
>
> 1. Accesses to bit fields (or other places where Clang generates
> loads
> larger than the underlying data type)
> 2. memcpy-like operations
> 3. Accesses though character pointers? (because character pointers
> can be
> used to access data of any type)

1. Bit fields definitely. Sometimes clang needs to widen loads
because of
the ABI, but these can probably be changed to vector loads instead.
2. Clang introduces memcpys to copy objects (which may have
uninitialized
padding). My suggestion (contrary to my previous opinion) is to make
memcpy
a bit-wise copy to simplify things.

And then how exactly do we turn small memcpys into "regular" loads/stores?

Thanks again,
Hal

>> I've implemented both semantics in Alive (branches: newsema2 and
>> bitwise-poison2, respectively) and then we run Alive over a large
>> set
>> of
>> optimizations to see which were ok and which became wrong.
>>
>> Many optimizations became wrong with either of the bitwise
>> semantics.
>
> To be clear, "wrong" here means that the value resulting from the
> transformation might have more poison bits for some inputs than the
> original value, correct?

The vast majority of the examples were of that case, yes. But we
also have
a few cases where the transformed expression computes a different
value (I
don't have a record of these handy, sorry).

Those are just bugs, right?

Well, increasing the number of poison bits would also be considered a bug.
I guess the main issue is that some InstCombine transformations that we consider correct today are not under the bitwise poison semantics. Some of these transformations start producing wrong values because poison is not propagating enough for us to be able to discard the inputs that trigger different output values as "uninteresting". Bitwise poison semantics has too many opportunities to lose poison (e.g., shifting out poison bits, no carry in arithmetic instructions, and/or with zero/one bits, etc).

>> Adding this extra instruction has some benefits: it simplifies IR
>> auto-upgrade, and frontends can move to the new instruction
>> incrementally.
>> Also, it seems that C++ codegen could use the freezing load in a
>> few
>> places.
>
> Can we be specific on what "few places" we're talking about? I
> assume this
> means at least:
>
> 1. Accesses to bit fields (or other places where Clang generates
> loads
> larger than the underlying data type)
> 2. memcpy-like operations
> 3. Accesses though character pointers? (because character pointers
> can be
> used to access data of any type)

1. Bit fields definitely. Sometimes clang needs to widen loads
because of
the ABI, but these can probably be changed to vector loads instead.
2. Clang introduces memcpys to copy objects (which may have
uninitialized
padding). My suggestion (contrary to my previous opinion) is to make
memcpy
a bit-wise copy to simplify things.

And then how exactly do we turn small memcpys into "regular" loads/stores?

Bit vector load/stores (i.e., <n x i1>). I was really happy to see that the codegen of i1 vectors has gotten some attention recently btw!

Nuno

Related work:

Hope everyone had a great Christmas!

John

This is pretty interesting and I had missed the earlier discussion.

http://lists.llvm.org/pipermail/llvm-dev/2016-December/108588.html

John