Wrong code bug after GVN/PRE?

Hi,

I've stumbled upon a case where I think gvn does a bad (wrong) optimization. It's a bit messy to debug though so I'm not sure if I should just write a PR about it a let someone who knows the code look at it instead.

Anyway, for the bug to trigger I need to run the following passes in the same opt invocation:
  -sroa -instcombine -simplifycfg -instcombine -gvn

The problem seems to be that GVN::PerformLoadPRE triggers and I see a

GVN REMOVING PRE LOAD: %_tmp79 = load i16, i16* %_tmp78, align 2

printout.

If I instead first run

  -sroa -instcombine -simplifycfg -instcombine

and then

  -gvn

I don't get the

GVN REMOVING PRE LOAD

printout, and the resulting code looks ok to me.

Is this expected? Should the output differ in these two cases?

The input to gvn looks the same when running all passes and just gvn, but I see a difference in how GVN::processNonLocalLoad(LoadInst *LI) behaves:

I get different results from

   // Step 1: Find the non-local dependencies of the load.
   LoadDepVect Deps;
   MD->getNonLocalPointerDependency(LI, Deps);

So we get different results from MemoryDependenceResults when invoking gvn alone or after a bunch of other passes.

Is this expected or not?

I tried to dig into MemoryDependenceResults::getNonLocalPointerDependency but I have to say I'm quite lost.

I ran some git bisect and as far as I can tell it starts going wrong with commit

  [SimplifyCFG] Change the algorithm in SinkThenElseCodeToEnd

2016-09-01, but that commit doesn't change GVN/MemoryDependenceResults so I suppose it just changes the input to GVN so the problem suddenly surfaces.

Finally, the problem that I see is:

In the input we have something like

  for (int step1 = 0; step1 < LOOP_AMOUNT; step1++)
  {
    lb[step1] = step1 + 7;
    ub[step1] = (step1 == 0 ? 10: ub[step1-1]*10);

    switch(lb[step1]) {
    case 7: CVAL_VERIFY(step1 == 0);
      CVAL_VERIFY(ub[step1] == 10);
                         __label(511);
      break;
    case 8: CVAL_VERIFY(step1 == 1);
      CVAL_VERIFY(ub[step1] == 100);
                         __label(512);
      break;
    case 9: CVAL_VERIFY(step1 == 2);
      CVAL_VERIFY(ub[step1] == 1000);
                         __label(513);
      break;
    default:;
    }

                 [...]
  }

  int i, j;
  for ( j = 10, i = 0; j < 101; j=j*10, i++ )
  {
    CVAL_VERIFY(ub[i] == j);
  }

So we have two loops. In the first one the array ub is written (low index first, high index last), and then in the second loop the content of ub is checked (low index first, high index last).

After gvn this is changed into something like

         int tmp;
  for (int step1 = 0; step1 < LOOP_AMOUNT; step1++)
  {
    lb[step1] = step1 + 7;
    ub[step1] = (step1 == 0 ? 10: ub[step1-1]*10);

    switch(lb[step1]) {
    case 7: CVAL_VERIFY(step1 == 0);
                         tmp = ub[step1];
      CVAL_VERIFY(tmp == 10);
                         __label(511);
      break;
    case 8: CVAL_VERIFY(step1 == 1);
                         tmp = ub[step1];
      CVAL_VERIFY(tmp == 100);
                         __label(512);
      break;
    case 9: CVAL_VERIFY(step1 == 2);
                         tmp = ub[step1];
      CVAL_VERIFY(tmp == 1000);
                         __label(513);
      break;
    default:;
    }

                 [...]
  }

  int i, j;
  for ( j = 10, i = 0; j < 101; j=j*10, i++ )
  {
    CVAL_VERIFY((i == 0 ? tmp : ub[i]) == j);
  }

Note the introduced variable tmp.

Also note that with this rewrite, after the first loop tmp will contain the value of the element at the highest index in ub, and then we use tmp in the first round of the second loop, but there we expect the value of the lowest index element in ub.

Any ideas? Should I file a PR?

Thanks,
Mikael

tr12053_nophx.ll (5.24 KB)

Ooh, wow, that's nasty: it looks like PRE is somehow deciding that ub[step1] and ub[0] are equivalent. Please file a bug.

The reason you're having trouble splitting the "opt" invocation is that you aren't passing in "-preserve-ll-uselistorder"; it usually doesn't matter, but occasionally you'll run into an optimization which is sensitive to it.

-Eli

Do we have known cases where use-list order sensitivity is not “a bug”?

Yeah, there’s a lot of things this could be.

On the memdep side:

Note that memdep is not actually properly updated in all cases by most passes that claim to not invalidate it (they don’t invalidate dependent pointers, only pointers they directly touch).

There’s already a bug filed about this. So far we’ve only seen missed-opt, not wrong code from this.
But it should be possible to generate wrong code bugs with it (if you place the load/store in a place that invalidates the cached dependent result) , so i wonder if this is one.

Otherwise, the other reason it could give different answers after running a bunch of passes is the random scanning limits it has.

On the PRE side:
PHITranslateAddr goes through a bunch of crazy machinations to try to be able to prove things about inter-iteration variables instead of just doing phi translation.

LoopLoadElimination was written so that these could be removed, because they are pretty much crazytown.

–Dan

Hi,

Hi,

I've stumbled upon a case where I think gvn does a bad (wrong)
optimization. It's a bit messy to debug though so I'm not sure if I
should just write a PR about it a let someone who knows the code look
at it instead.

Anyway, for the bug to trigger I need to run the following passes in
the same opt invocation:
-sroa -instcombine -simplifycfg -instcombine -gvn

The problem seems to be that GVN::PerformLoadPRE triggers and I see a

GVN REMOVING PRE LOAD: %_tmp79 = load i16, i16* %_tmp78, align 2

printout.

If I instead first run

-sroa -instcombine -simplifycfg -instcombine

and then

-gvn

I don't get the

GVN REMOVING PRE LOAD

printout, and the resulting code looks ok to me.

Is this expected? Should the output differ in these two cases?

The input to gvn looks the same when running all passes and just gvn,
but I see a difference in how GVN::processNonLocalLoad(LoadInst *LI)
behaves:

I get different results from

  // Step 1: Find the non-local dependencies of the load.
  LoadDepVect Deps;
  MD->getNonLocalPointerDependency(LI, Deps);

So we get different results from MemoryDependenceResults when invoking
gvn alone or after a bunch of other passes.

Is this expected or not?

I tried to dig into
MemoryDependenceResults::getNonLocalPointerDependency but I have to
say I'm quite lost.

I ran some git bisect and as far as I can tell it starts going wrong
with commit

[SimplifyCFG] Change the algorithm in SinkThenElseCodeToEnd

2016-09-01, but that commit doesn't change GVN/MemoryDependenceResults
so I suppose it just changes the input to GVN so the problem suddenly
surfaces.

Finally, the problem that I see is:

In the input we have something like

    for (int step1 = 0; step1 < LOOP_AMOUNT; step1++)
    {
        lb[step1] = step1 + 7;
        ub[step1] = (step1 == 0 ? 10: ub[step1-1]*10);

        switch(lb[step1]) {
        case 7: CVAL_VERIFY(step1 == 0);
            CVAL_VERIFY(ub[step1] == 10);
                        __label(511);
            break;
        case 8: CVAL_VERIFY(step1 == 1);
            CVAL_VERIFY(ub[step1] == 100);
                        __label(512);
            break;
        case 9: CVAL_VERIFY(step1 == 2);
            CVAL_VERIFY(ub[step1] == 1000);
                        __label(513);
            break;
        default:;
        }

                [...]
    }

    int i, j;
    for ( j = 10, i = 0; j < 101; j=j*10, i++ )
    {
        CVAL_VERIFY(ub[i] == j);
    }

So we have two loops. In the first one the array ub is written (low
index first, high index last), and then in the second loop the content
of ub is checked (low index first, high index last).

After gvn this is changed into something like

        int tmp;
    for (int step1 = 0; step1 < LOOP_AMOUNT; step1++)
    {
        lb[step1] = step1 + 7;
        ub[step1] = (step1 == 0 ? 10: ub[step1-1]*10);

        switch(lb[step1]) {
        case 7: CVAL_VERIFY(step1 == 0);
                        tmp = ub[step1];
            CVAL_VERIFY(tmp == 10);
                        __label(511);
            break;
        case 8: CVAL_VERIFY(step1 == 1);
                        tmp = ub[step1];
            CVAL_VERIFY(tmp == 100);
                        __label(512);
            break;
        case 9: CVAL_VERIFY(step1 == 2);
                        tmp = ub[step1];
            CVAL_VERIFY(tmp == 1000);
                        __label(513);
            break;
        default:;
        }

                [...]
    }

    int i, j;
    for ( j = 10, i = 0; j < 101; j=j*10, i++ )
    {
        CVAL_VERIFY((i == 0 ? tmp : ub[i]) == j);
    }

Note the introduced variable tmp.

Also note that with this rewrite, after the first loop tmp will
contain the value of the element at the highest index in ub, and then
we use tmp in the first round of the second loop, but there we expect
the value of the lowest index element in ub.

Any ideas? Should I file a PR?

Ooh, wow, that's nasty: it looks like PRE is somehow deciding that
ub[step1] and ub[0] are equivalent.

Yes seems like it :confused:

Please file a bug.

Alright!

https://llvm.org/bugs/show_bug.cgi?id=31651

The reason you're having trouble splitting the "opt" invocation is that
you aren't passing in "-preserve-ll-uselistorder"; it usually doesn't
matter, but occasionally you'll run into an optimization which is
sensitive to it.

Aha! Yes, when I used -preserve-ll-uselistorder I could split the passes, so now I see the problem when just running GVN. Thanks!

Regards,
Mikael

Hi,

Yeah, there's a lot of things this could be.

On the memdep side:
Note that memdep is not actually properly updated in all cases by most
passes that claim to not invalidate it (they don't invalidate dependent
pointers, only pointers they directly touch).

There's already a bug filed about this.

You mean
  https://llvm.org/bugs/show_bug.cgi?id=27740
?

So far we've only seen
missed-opt, not wrong code from this.
But it should be possible to generate wrong code bugs with it (if you
place the load/store in a place that invalidates the cached dependent
result) , so i wonder if this is one.

Any idea what to do about it?

I can disable load-PRE in gvn to make this instance of the problem go away (albeit at a performance cost), but if there is a general problem with memdep I suppose it's not only load PRE where things may go wrong due to it?

Thanks,
Mikael

Hi,

Yeah, there's a lot of things this could be.

On the memdep side:
Note that memdep is not actually properly updated in all cases by most
passes that claim to not invalidate it (they don't invalidate dependent
pointers, only pointers they directly touch).

There's already a bug filed about this.

You mean
https://llvm.org/bugs/show_bug.cgi?id=27740
?

Among others.

So far we've only seen

missed-opt, not wrong code from this.
But it should be possible to generate wrong code bugs with it (if you
place the load/store in a place that invalidates the cached dependent
result) , so i wonder if this is one.

Any idea what to do about it?

It depends on where the real bug is.

You filed a reproducer that does not require running the other passes
first, which means it's either just a bug in PRE, or a bug in memdep.

If i was a betting man, i'd bet on PRE and it's inter-iteration pre
attempts.
Staring at this case for a second, my money is on the following happening.

1. PRE starts by asking if the value of step[i] where i == 0 is fully
available in all preds
2. PHITransaddr get it as part of memdep, and inside of those preds (which
are in loop 1, not loop2), using it's magical powers, gives an answer about
step[i] (note, not step[0]) instead. This answer may even be valid in the
loop. The problem may even be related to the fact that the loops both
iterate the same way (so it thinks step[i] is valid in both, even though
the i's are different).
3. The answer is not valid outside of the loop, but it now thinks the value
is fully available
4. It performs insertions and gives you the wrong answer.

If i disable all of phitransaddr's fun magic, your testcase starts working.

I'll be honest, i don't have the energy ATM to track down what fun safety
condition it's missing here.
I'd actually rather disable the code (which is a mess) and focus on
improving looploadelimination.
Maybe someone else wants to take a look at it.

(I'll add all this to the bug)

--Dan

If someone can produce an IR reproducer which they believe is incorrect, I’m happy to take a look at the PRE code. I will not have the time to reproduce this from C or work through the interaction of other passes; I will only look at this if there is a bug with PRE specific IR reproducer attached.

Philip

https://llvm.org/bugs/show_bug.cgi?id=31651 has an IR testcase.

-Eli

Hi Philip,

If someone can produce an *IR* reproducer which they believe is
incorrect, I'm happy to take a look at the PRE code. I will not have
the time to reproduce this from C or work through the interaction of
other passes; I will only look at this if there is a bug with PRE
specific IR reproducer attached.

https://llvm.org/bugs/show_bug.cgi?id=31651 has an IR testcase.

As Eli said, in the PR there is IR that I think is miscompiled if you run gvn on it.

In the PR and earlier in this thread I pasted some C-like code just to describe what I think is happening since I thought it was easier than pasting llvm assembler, but the pre_gvn.ll file in PR31651 can indeed be used to reproduce the problem by just running gvn on it.

Great if you could have a look!
Mikael

Will try to get to this by end of week. Feel free to ping me if I don't respond by then.

Philip