Possible SelectionDAG Bug

In the continuing quest to try to track down problems we're seeing in
SelectionDAG, I added the following assert
toSelectionDAG::ReplaceAllUsesOfValuesWith:

  // Read up all the uses and make records of them. This helps
  // processing new uses that are introduced during the
  // replacement process.
  SmallVector<UseMemo, 4> Uses;
  for (unsigned i = 0; i != Num; ++i) {
    unsigned FromResNo = From[i].getResNo();
    SDNode *FromNode = From[i].getNode();

#ifndef NDEBUG
    assert(FromNode->getOpcode() != ISD::DELETED_NODE && "FromNode deleted!");
#endif

This triggers all over the place in the testbase. Is it expected that we
could get a deleted node here? The following code appears to assume
FromNode is a valid SDNode.

                                                -Dave

Here's a patch to add more of these deleted node asserts. They fire
tons of times in the testbase.

This concerns me greatly.

Are all of these asserts valid? If not, we should document why with
comments in the source.

                                             -Dave

sdassert.patch (5.39 KB)

Ping? Just want to make sure this didn't get missed somehow. I'm
surprised to see no discussion.

                                            -Dave

I've now looked at your latest patch. In summary, it does expose a
subtle problem. I haven't seen anything that here would lead to
observable misbehavior yet though.

X86GenDAGISel.inc has code like this:

   SDValue N1 = N->getOperand(1);
   ...
   SDNode *ResNode = CurDAG->SelectNodeTo(N, ...);
   ...
   ReplaceUses(SDValue(N1.getNode(), 1), SDValue(ResNode, 2));

If N was the only user of N1, and N1 isn't in the new operand list, then
the SelectNodeTo call will make N1 dead. SelectNodeTo will automatically
delete nodes that are made dead by the transformation. This means that
the "from" node in the subsequent ReplaceUses call will be a deleted node
in that case.

This doesn't break in practice because deleted node aren't actually
deallocated, they're just put on a linked list of nodes ready to be
reused next time a new node is created, and no new nodes are created
between the SelectNodeTo call and the ReplaceUses call.

Another observation here is that the ReplaceUses call in such cases
doesn't do anything, because the From node has no uses.

Perhaps this can be fixed by making the code skip the ReplaceUses
call in the case where there are no uses to replace. That's not trivial
to detect though.

Dan

I've now looked at your latest patch. In summary, it does expose a
subtle problem. I haven't seen anything that here would lead to
observable misbehavior yet though.

Well, I'm definitely observing misbehavior. I know it has something to do
with local changes here but I haven't isolated it yet.

X86GenDAGISel.inc has code like this:

   SDValue N1 = N->getOperand(1);
   ...
   SDNode *ResNode = CurDAG->SelectNodeTo(N, ...);
   ...
   ReplaceUses(SDValue(N1.getNode(), 1), SDValue(ResNode, 2));

If N was the only user of N1, and N1 isn't in the new operand list, then
the SelectNodeTo call will make N1 dead. SelectNodeTo will automatically
delete nodes that are made dead by the transformation. This means that
the "from" node in the subsequent ReplaceUses call will be a deleted
node
in that case.

Ok, so that should be easy to fix.

This doesn't break in practice because deleted node aren't actually
deallocated, they're just put on a linked list of nodes ready to be
reused next time a new node is created, and no new nodes are created
between the SelectNodeTo call and the ReplaceUses call.

Right.

Another observation here is that the ReplaceUses call in such cases
doesn't do anything, because the From node has no uses.

Right.

Perhaps this can be fixed by making the code skip the ReplaceUses
call in the case where there are no uses to replace. That's not trivial
to detect though.

Why not just check the same thing the added asserts check?

What I'm seeing is a problem in ReplaceAllUsesOf itself. It recurses
down and eventually replaces the node under the iterator in this use
loop:

  SDNode::use_iterator UI = From.getNode()->use_begin(),
                       UE = From.getNode()->use_end();
  while (UI != UE) {
    SDNode *User = *UI;
    bool UserRemovedFromCSEMaps = false;

UI goes bad and we blow up after returning from a deeply recursed call.

It's simply not safe to iterate over a set that may change. Unfortunately,
any of the nodes under the iterators may change so I don't see an easy
way to fix this.

                                               -Dave

Perhaps this can be fixed by making the code skip the ReplaceUses
call in the case where there are no uses to replace. That's not trivial
to detect though.

Why not just check the same thing the added asserts check?

You mean ->getOpcode() == ISD::DELETED_NODE? That's not fundamentally
any better, because if your purpose is to make this code work even
if nodes are actually deleted, that would still be a use of free'd
memory.

What I'm seeing is a problem in ReplaceAllUsesOf itself. It recurses
down and eventually replaces the node under the iterator in this use
loop:

SDNode::use_iterator UI = From.getNode()->use_begin(),
                      UE = From.getNode()->use_end();
while (UI != UE) {
   SDNode *User = *UI;
   bool UserRemovedFromCSEMaps = false;

UI goes bad and we blow up after returning from a deeply recursed call.

It's simply not safe to iterate over a set that may change. Unfortunately,
any of the nodes under the iterators may change so I don't see an easy
way to fix this.

The thing it's iterating over is a linked list. And the use_end() iterator
is essentially a null pointer.

Dan

>> Perhaps this can be fixed by making the code skip the ReplaceUses
>> call in the case where there are no uses to replace. That's not trivial
>> to detect though.
>
> Why not just check the same thing the added asserts check?

You mean ->getOpcode() == ISD::DELETED_NODE? That's not fundamentally
any better, because if your purpose is to make this code work even
if nodes are actually deleted, that would still be a use of free'd
memory.

Wait, I thought memory wasn't actually freed, just returned to a free list.

> UI goes bad and we blow up after returning from a deeply recursed call.
>
> It's simply not safe to iterate over a set that may change.
> Unfortunately, any of the nodes under the iterators may change so I don't
> see an easy way to fix this.

The thing it's iterating over is a linked list. And the use_end() iterator
is essentially a null pointer.

No, what I mean is the thing under UI at the point of call to
AddModifiedNodeToCSEMaps gets deleted. So UI is invalid and when
we loop back around and check it against UE we blow up with a
singular iterator error.

I can add code to save a few values of UI, find one that "works"
after AddModifiedNodeToCSEMaps and get llc to finish. But that's
a horrible hack only meant to prove that this is actually the problem.

We're essentially doing this:

std::list<...>::iterator i = ...
for (... i != list.end(); ++i) {
  foo(i);
}

foo(iterator i) {
  list.erase(i);
}

                                          -Dave

UI is incremented before AddModifiedNodeToCSEMaps is called, so
I'm still not seeing what you're describing here.

Dan

It's not the increment that trips, it's the loop top compare to UE. It
doesn't matter where UI points at the call to AddModifiedNodeToCSEMaps.
The fact that AddModifiedNodeToCSEMaps can recursively call
ReplaceAllUsesOf means that we can potentially delete the node
out from under UI.

                                           -Dave

Ok, I think I've finally managed to draw on my whiteboard a theoretical
situation which could have a problem like this.

The attached patch should theoretically fix this bug, though I have no
way to confirm this right now. Does it fix the bug you're seeing?

Dan

rauw.patch (3.76 KB)

Ok, I think I've finally managed to draw on my whiteboard a theoretical
situation which could have a problem like this.

Do you have a way to express it over the list? I'd like to learn a little
more about SelectionDAG and how this could happen.

The attached patch should theoretically fix this bug, though I have no
way to confirm this right now. Does it fix the bug you're seeing?

Can you send it again? I get a zero-length file. Same with the attached
.txt file.

                                                   -Dave

Never mind. Apparently Cray mangles it. My home e-mail passed it through
just fine.

                                           -Dave

The attached patch should theoretically fix this bug, though I have no

Clever. :slight_smile:

way to confirm this right now. Does it fix the bug you're seeing?

Yep, it fixed it.

Now we should get a testcase for this. Can you describe the (not so :slight_smile: )
theoretical way this could be triggered? Then we can synthesize a testcase.

                                        -Dave

Hmm...curiously, not all. More tomorrow.

                                            -Dave

Ah, missed a spot in 2.5, which has a few more RAUW implementations.
I think we're good.

Thanks!

                                             -Dave

way to confirm this right now. Does it fix the bug you're seeing?

Yep, it fixed it.

Hmm...curiously, not all. More tomorrow.

Ah, missed a spot in 2.5, which has a few more RAUW implementations.
I think we're good.

Ok. Are you able to produce a testcase? Does it depend on custom changes?
Where does the failure happen (legalize, dagcombine, selection, etc.)?

The interesting case happens when the "++UI;" code in the patch gets
executed, so you could put a breakpoint there and look at the graph at
that point, or insert an abort and use bugpoint.

Anyway, here's the theoretical situation I imagined:

A is a user of T. B, C, and D are all users of F. C is also a user of B,
and D is also a user of A. A and B have the same opcode. C and D have the
same opcode.

  F T
// \ |

> >
B A

\ / /

\ C /
\ |
  \ |
   \|
    D

RAUW(F, T) is called. B is first on F's use list. C is the next use,
so the use iterator is incremented to point to C. B's use is updated.

  F T
// /|

/ |
B A

\ / /

\ C /
\ |
  \ |
   \|
    D

B would now be identical to A, so RAUW(B, A) gets called and B is
deleted. C was a user of B so it now becomes a user of A.

  F T
// |

   >
__A

\ / /

\ C /
\ |
  \ |
   \|
    D

That would make C identical to D, so RAUW(C, D) gets called and C is
deleted.

  F T
/ |

    >
    A
   /

\ /
\ |
  \ |
   \|
    D

CSE and recursive RAUW are now done, so control transfers back to
the original RAUW call. C was where the the first RAUW call's
iterator was pointing, and it's now deleted, so that iterator is
now invalid.

Dan

> Ah, missed a spot in 2.5, which has a few more RAUW implementations.
> I think we're good.

Ok. Are you able to produce a testcase? Does it depend on custom changes?

Apparently it does, because I have not been able to get the reduced testcase
I have to fail in any official LLVM release. I've tried disabling various
customizations here but have so far been unsuccessful. I could certainly
contribute the reduced testcase if people think it would be useful. It did
after all find a problem.

Where does the failure happen (legalize, dagcombine, selection, etc.)?

DAGCombine.

The interesting case happens when the "++UI;" code in the patch gets
executed, so you could put a breakpoint there and look at the graph at
that point, or insert an abort and use bugpoint.

Right. I used bugpoint to reduce the testcase, but this was with our modified
LLVM. I can submit that even though it never failed with stock LLVM code.

Anyway, here's the theoretical situation I imagined:

CSE and recursive RAUW are now done, so control transfers back to
the original RAUW call. C was where the the first RAUW call's
iterator was pointing, and it's now deleted, so that iterator is
now invalid.

That makes sense. In my particular situation it was a combinations of loads,
stores and chains. Would it help to dump and/or graphviz the DAG before the
offending combine? Is there some way we could construct a testcase given
a SelectionDAG?

                                                   -Dave