Possible SelectionDAG Bug

David_A_Greene · February 26, 2010, 3:55pm

In the continuing quest to try to track down problems we're seeing in
SelectionDAG, I added the following assert
toSelectionDAG::ReplaceAllUsesOfValuesWith:

  // Read up all the uses and make records of them. This helps
  // processing new uses that are introduced during the
  // replacement process.
  SmallVector<UseMemo, 4> Uses;
  for (unsigned i = 0; i != Num; ++i) {
    unsigned FromResNo = From[i].getResNo();
    SDNode *FromNode = From[i].getNode();

#ifndef NDEBUG
assert(FromNode->getOpcode() != ISD::DELETED_NODE && "FromNode deleted!");
#endif

This triggers all over the place in the testbase. Is it expected that we
could get a deleted node here? The following code appears to assume
FromNode is a valid SDNode.

-Dave

David_A_Greene · February 26, 2010, 4:34pm

Here's a patch to add more of these deleted node asserts. They fire
tons of times in the testbase.

This concerns me greatly.

Are all of these asserts valid? If not, we should document why with
comments in the source.

-Dave

sdassert.patch (5.39 KB)

David_A_Greene · February 26, 2010, 10:07pm

Ping? Just want to make sure this didn't get missed somehow. I'm
surprised to see no discussion.

-Dave

Dan_Gohman3 · February 27, 2010, 1:09am

I've now looked at your latest patch. In summary, it does expose a
subtle problem. I haven't seen anything that here would lead to
observable misbehavior yet though.

X86GenDAGISel.inc has code like this:

   SDValue N1 = N->getOperand(1);
   ...
   SDNode *ResNode = CurDAG->SelectNodeTo(N, ...);
   ...
   ReplaceUses(SDValue(N1.getNode(), 1), SDValue(ResNode, 2));

If N was the only user of N1, and N1 isn't in the new operand list, then
the SelectNodeTo call will make N1 dead. SelectNodeTo will automatically
delete nodes that are made dead by the transformation. This means that
the "from" node in the subsequent ReplaceUses call will be a deleted node
in that case.

This doesn't break in practice because deleted node aren't actually
deallocated, they're just put on a linked list of nodes ready to be
reused next time a new node is created, and no new nodes are created
between the SelectNodeTo call and the ReplaceUses call.

Another observation here is that the ReplaceUses call in such cases
doesn't do anything, because the From node has no uses.

Perhaps this can be fixed by making the code skip the ReplaceUses
call in the case where there are no uses to replace. That's not trivial
to detect though.

Dan

David_A_Greene · March 1, 2010, 3:26pm

I've now looked at your latest patch. In summary, it does expose a
subtle problem. I haven't seen anything that here would lead to
observable misbehavior yet though.

Well, I'm definitely observing misbehavior. I know it has something to do
with local changes here but I haven't isolated it yet.

X86GenDAGISel.inc has code like this:

   SDValue N1 = N->getOperand(1);
   ...
   SDNode *ResNode = CurDAG->SelectNodeTo(N, ...);
   ...
   ReplaceUses(SDValue(N1.getNode(), 1), SDValue(ResNode, 2));

If N was the only user of N1, and N1 isn't in the new operand list, then
the SelectNodeTo call will make N1 dead. SelectNodeTo will automatically
delete nodes that are made dead by the transformation. This means that
the "from" node in the subsequent ReplaceUses call will be a deleted
node
in that case.

Ok, so that should be easy to fix.

This doesn't break in practice because deleted node aren't actually
deallocated, they're just put on a linked list of nodes ready to be
reused next time a new node is created, and no new nodes are created
between the SelectNodeTo call and the ReplaceUses call.

Right.

Another observation here is that the ReplaceUses call in such cases
doesn't do anything, because the From node has no uses.

Right.

Perhaps this can be fixed by making the code skip the ReplaceUses
call in the case where there are no uses to replace. That's not trivial
to detect though.

Why not just check the same thing the added asserts check?

What I'm seeing is a problem in ReplaceAllUsesOf itself. It recurses
down and eventually replaces the node under the iterator in this use
loop:

  SDNode::use_iterator UI = From.getNode()->use_begin(),
                       UE = From.getNode()->use_end();
  while (UI != UE) {
    SDNode *User = *UI;
    bool UserRemovedFromCSEMaps = false;

UI goes bad and we blow up after returning from a deeply recursed call.

It's simply not safe to iterate over a set that may change. Unfortunately,
any of the nodes under the iterators may change so I don't see an easy
way to fix this.

-Dave

Dan_Gohman3 · March 1, 2010, 7:41pm

Perhaps this can be fixed by making the code skip the ReplaceUses
call in the case where there are no uses to replace. That's not trivial
to detect though.

Why not just check the same thing the added asserts check?

You mean ->getOpcode() == ISD::DELETED_NODE? That's not fundamentally
any better, because if your purpose is to make this code work even
if nodes are actually deleted, that would still be a use of free'd
memory.

What I'm seeing is a problem in ReplaceAllUsesOf itself. It recurses
down and eventually replaces the node under the iterator in this use
loop:

SDNode::use_iterator UI = From.getNode()->use_begin(),
                      UE = From.getNode()->use_end();
while (UI != UE) {
   SDNode *User = *UI;
   bool UserRemovedFromCSEMaps = false;

UI goes bad and we blow up after returning from a deeply recursed call.

It's simply not safe to iterate over a set that may change. Unfortunately,
any of the nodes under the iterators may change so I don't see an easy
way to fix this.

The thing it's iterating over is a linked list. And the use_end() iterator
is essentially a null pointer.

Dan

David_A_Greene · March 1, 2010, 9:59pm

>> Perhaps this can be fixed by making the code skip the ReplaceUses
>> call in the case where there are no uses to replace. That's not trivial
>> to detect though.
>
> Why not just check the same thing the added asserts check?

You mean ->getOpcode() == ISD::DELETED_NODE? That's not fundamentally
any better, because if your purpose is to make this code work even
if nodes are actually deleted, that would still be a use of free'd
memory.

Wait, I thought memory wasn't actually freed, just returned to a free list.

> UI goes bad and we blow up after returning from a deeply recursed call.
>
> It's simply not safe to iterate over a set that may change.
> Unfortunately, any of the nodes under the iterators may change so I don't
> see an easy way to fix this.

The thing it's iterating over is a linked list. And the use_end() iterator
is essentially a null pointer.

No, what I mean is the thing under UI at the point of call to
AddModifiedNodeToCSEMaps gets deleted. So UI is invalid and when
we loop back around and check it against UE we blow up with a
singular iterator error.

I can add code to save a few values of UI, find one that "works"
after AddModifiedNodeToCSEMaps and get llc to finish. But that's
a horrible hack only meant to prove that this is actually the problem.

We're essentially doing this:

std::list<...>::iterator i = ...
for (... i != list.end(); ++i) {
foo(i);
}

foo(iterator i) {
list.erase(i);
}

-Dave

Dan_Gohman3 · March 2, 2010, 2:32am

UI is incremented before AddModifiedNodeToCSEMaps is called, so
I'm still not seeing what you're describing here.

Dan

David_A_Greene · March 2, 2010, 4:57pm

It's not the increment that trips, it's the loop top compare to UE. It
doesn't matter where UI points at the call to AddModifiedNodeToCSEMaps.
The fact that AddModifiedNodeToCSEMaps can recursively call
ReplaceAllUsesOf means that we can potentially delete the node
out from under UI.

-Dave

Dan_Gohman3 · March 2, 2010, 10:13pm

Ok, I think I've finally managed to draw on my whiteboard a theoretical
situation which could have a problem like this.

The attached patch should theoretically fix this bug, though I have no
way to confirm this right now. Does it fix the bug you're seeing?

Dan

rauw.patch (3.76 KB)

David_A_Greene · March 2, 2010, 11:23pm

Ok, I think I've finally managed to draw on my whiteboard a theoretical
situation which could have a problem like this.

Do you have a way to express it over the list? I'd like to learn a little
more about SelectionDAG and how this could happen.

The attached patch should theoretically fix this bug, though I have no
way to confirm this right now. Does it fix the bug you're seeing?

Can you send it again? I get a zero-length file. Same with the attached
.txt file.

-Dave

David_A_Greene · March 2, 2010, 11:26pm

Never mind. Apparently Cray mangles it. My home e-mail passed it through
just fine.

-Dave

David_A_Greene · March 2, 2010, 11:58pm

The attached patch should theoretically fix this bug, though I have no

Clever.

way to confirm this right now. Does it fix the bug you're seeing?

Yep, it fixed it.

Now we should get a testcase for this. Can you describe the (not so )
theoretical way this could be triggered? Then we can synthesize a testcase.

-Dave

David_A_Greene · March 3, 2010, 12:24am

Hmm...curiously, not all. More tomorrow.

-Dave

David_A_Greene · March 3, 2010, 3:25pm

Ah, missed a spot in 2.5, which has a few more RAUW implementations.
I think we're good.

Thanks!

-Dave

Dan_Gohman3 · March 3, 2010, 7:26pm

way to confirm this right now. Does it fix the bug you're seeing?

Yep, it fixed it.

Hmm...curiously, not all. More tomorrow.

Ah, missed a spot in 2.5, which has a few more RAUW implementations.
I think we're good.

Ok. Are you able to produce a testcase? Does it depend on custom changes?
Where does the failure happen (legalize, dagcombine, selection, etc.)?

The interesting case happens when the "++UI;" code in the patch gets
executed, so you could put a breakpoint there and look at the graph at
that point, or insert an abort and use bugpoint.

Anyway, here's the theoretical situation I imagined:

A is a user of T. B, C, and D are all users of F. C is also a user of B,
and D is also a user of A. A and B have the same opcode. C and D have the
same opcode.

F T
// \ |

> >
B A

\ / /

\ C /
\ |
  \ |
   \|
    D

RAUW(F, T) is called. B is first on F's use list. C is the next use,
so the use iterator is incremented to point to C. B's use is updated.

F T
// /|

/ |
B A

\ / /

\ C /
\ |
  \ |
   \|
    D

B would now be identical to A, so RAUW(B, A) gets called and B is
deleted. C was a user of B so it now becomes a user of A.

F T
// |

>
__A

\ / /

\ C /
\ |
  \ |
   \|
    D

That would make C identical to D, so RAUW(C, D) gets called and C is
deleted.

F T
/ |

    >
    A
   /

\ /
\ |
  \ |
   \|
    D

CSE and recursive RAUW are now done, so control transfers back to
the original RAUW call. C was where the the first RAUW call's
iterator was pointing, and it's now deleted, so that iterator is
now invalid.

Dan

David_A_Greene · March 3, 2010, 8:15pm

> Ah, missed a spot in 2.5, which has a few more RAUW implementations.
> I think we're good.

Ok. Are you able to produce a testcase? Does it depend on custom changes?

Apparently it does, because I have not been able to get the reduced testcase
I have to fail in any official LLVM release. I've tried disabling various
customizations here but have so far been unsuccessful. I could certainly
contribute the reduced testcase if people think it would be useful. It did
after all find a problem.

Where does the failure happen (legalize, dagcombine, selection, etc.)?

DAGCombine.

The interesting case happens when the "++UI;" code in the patch gets
executed, so you could put a breakpoint there and look at the graph at
that point, or insert an abort and use bugpoint.

Right. I used bugpoint to reduce the testcase, but this was with our modified
LLVM. I can submit that even though it never failed with stock LLVM code.

Anyway, here's the theoretical situation I imagined:

CSE and recursive RAUW are now done, so control transfers back to
the original RAUW call. C was where the the first RAUW call's
iterator was pointing, and it's now deleted, so that iterator is
now invalid.

That makes sense. In my particular situation it was a combinations of loads,
stores and chains. Would it help to dump and/or graphviz the DAG before the
offending combine? Is there some way we could construct a testcase given
a SelectionDAG?

-Dave

Topic		Replies	Views
SelectionDAGISel::Select's API considered harmful LLVM Dev List Archives	4	83	May 23, 2016
[PATCH] Debug SelectionDAG & SDUse LLVM Dev List Archives	0	72	February 25, 2010
Question about changes to 'SelectionDAGISel.h' LLVM Dev List Archives	5	68	June 29, 2016
We have a pervasive problem in the SelectionDAG: the CSEMap. LLVM Dev List Archives	1	71	August 2, 2014
mysterious assert in SelectionDAGLegalize::LegalizeOp() LLVM Dev List Archives	0	69	October 15, 2015

Possible SelectionDAG Bug

Related Topics