Question about NoWrap flag for SCEVAddRecExpr

chentong319 · June 10, 2015, 1:21pm

I am testing vectorization on the following test case:

float x[1024], y[1024];

void myloop1() {
for (long int k = 0; k < 512; k++) {
x[2k] = x[2k]+y[k];
}
}

Vectorization failed due to “unsafe dependent memory operation”. I traced the LoopAccessAnalysis.cpp and found the reason is the NoWrapFlag for SCEVAddRecExpr is not set and consequently the dependence distant became unknown.

Can anyone familiar with ScalarRevolution tell me whether this is an expected behavior or a bug?

Tong

Sanjoy_Das1 · June 10, 2015, 8:29pm

[+CC Andy]

Can anyone familiar with ScalarRevolution tell me whether this is an
expected behavior or a bug?

Assuming you're talking about 2*k, this is a bug. ScalarEvolution
should be able to prove that {0,+,4} is <nsw> and <nuw>.

Part of the problem is that in this case ScalarEvolution does not try
to prove that {0,+,4} is <nsw> when the expression is constructed
(since proving that has non-trivial cost) [1]. To get ScalarEvolution
to try to prove that {0,+,4} has no-wrap properties, the client needs
to construct a sign-extend expression of {0,+,4}. SCEV will try to
change a sext of an add-rec to an add-rec of sexts and try to prove
no-wrap in the process [2].

To easily do this from IR, you can just add a dummy sext instruction
(like in the IR fragment below) and run
'opt -analyze -scalar-evolution -scalar-evolution' (just running SCEV
won't dce the unused instruction):

  ; ModuleID = '<stdin>'
  target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
  target triple = "x86_64-apple-macosx10.10.0"

@x = common global [1024 x float] zeroinitializer, align 16
@y = common global [1024 x float] zeroinitializer, align 16

  ; Function Attrs: nounwind ssp uwtable
  define void @myloop1() {
  bb:
    br label %bb2

bb1: ; preds = %bb2
ret void

  bb2: ; preds = %bb2, %bb
    %k.01 = phi i64 [ 0, %bb ], [ %tmp15, %bb2 ]
    %tmp = shl nsw i64 %k.01, 1
    %dummy_sext = sext i64 %tmp to i128
    %tmp3 = getelementptr inbounds [1024 x float], [1024 x float]* @x,
i64 0, i64 %tmp
    %tmp4 = load float, float* %tmp3, align 16, !tbaa !2
    %tmp5 = getelementptr inbounds [1024 x float], [1024 x float]* @y,
i64 0, i64 %k.01
    %tmp6 = load float, float* %tmp5, align 8, !tbaa !2
    %tmp7 = fadd float %tmp4, %tmp6
    store float %tmp7, float* %tmp3, align 16, !tbaa !2
    %tmp8 = or i64 %k.01, 1
    %tmp9 = shl nsw i64 %tmp8, 1
    %tmp10 = getelementptr inbounds [1024 x float], [1024 x float]*
@x, i64 0, i64 %tmp9
    %tmp11 = load float, float* %tmp10, align 8, !tbaa !2
    %tmp12 = getelementptr inbounds [1024 x float], [1024 x float]*
@y, i64 0, i64 %tmp8
    %tmp13 = load float, float* %tmp12, align 4, !tbaa !2
    %tmp14 = fadd float %tmp11, %tmp13
    store float %tmp14, float* %tmp10, align 8, !tbaa !2
    %tmp15 = add nsw i64 %k.01, 2
    %exitcond.1 = icmp eq i64 %tmp15, 512
    br i1 %exitcond.1, label %bb1, label %bb2
  }

  !0 = !{i32 1, !"PIC Level", i32 2}
  !1 = !{!"clang version 3.7.0 (clang-stage2-configure-Rlto_build 239114)"}
  !2 = !{!3, !3, i64 0}
  !3 = !{!"float", !4, i64 0}
  !4 = !{!"omnipotent char", !5, i64 0}
  !5 = !{!"Simple C/C++ TBAA"}

However, adding a dummy sext only proves <nuw> for {0,+,4} and not
<nsw>. The problem is that when constructing sext({0,+,4}) SCEV
realizes that since {0,+,4} is always positive, sext({0,+,4}) ==
zext({0,+,4}); and to change a zext of an add-rec to an add-rec of
zexts, SCEV only needs to prove <nuw> and not <nsw>.

I wonder if it makes sense to add a hook to SCEV that gets it to try
as hard as it can to prove <nuw> / <nsw> for specific add recurrences.

[1]: It will try to prove nuw and nsw in cases where it is "easy", but
not in this specific case.

[2]: So a worthwhile project is to have the vectorizer construct sign
     extend expressions of add recurrences that it really cares about
     proving no-wrap of and then check the flags on the
     SCEVAddRecExpr. It may consume too much compile time, so there's
     a tricky trade-off here.

-- Sanjoy

ghoflehner · June 10, 2015, 11:29pm

Hi Sanjoy,

Interesting. I took a closer look at SCEV when I saw Tong’s example. I’m curious about "[1]: It will try to prove nuw and nsw in cases where it is "easy", but not in this specific case.” Could you describe “easy” and “difficult” algorithmically?

Thanks
Gerolf

Sanjoy_Das1 · June 11, 2015, 12:14am

Hi Gerolf,

I don't have a precise algorithmic notion of "easy" and "hard" for
this. But, in a hand-wavy sense,

"easy" is a quick check on the flags on the increment expression. See
definition of 'Flags' in ScalarEvolution::createNodeForPHI.

"hard" is doing things like computing the backedge taken count and
using that to prove no-wrap (amongst other things). See the
SCEVAddRecExpr case in ScalarEvolution::getSignExtendExpr.

-- Sanjoy

anemet · June 11, 2015, 12:29am

[+Arnold]

[+CC Andy]

Can anyone familiar with ScalarRevolution tell me whether this is an
expected behavior or a bug?

Assuming you're talking about 2*k, this is a bug. ScalarEvolution
should be able to prove that {0,+,4} is <nsw> and <nuw>.

I also find it surprising that the inbounds gep does not allow us to prove nuw of the pointer here. LAA has logic for this.

Not to mention that we’re trying to figure out the distance of x[2*k] against *itself* which should be zero regardless of wrapping.

A probably more interesting case is x[2*k] = x[2*k+2] + … which would require non-wrapping.

Looks like we only consider an inbounds gep non-wrapping if it uses a unit stride. I am not sure why…

Adam

Sanjoy_Das1 · June 11, 2015, 1:17am

I'm not sure if inbounds can be used to prove <nuw>. If an object
%OBJ is allocated at address -1 then "gep inbounds %OBJ 1" is not
poison, but the underlying computation unsigned-overflows.

It may be possible to use inbounds to derive <nsw> but that too is
tricky since inbounds does not directly imply anything about no-wrap
as far as I can tell.

Arnold · June 11, 2015, 1:19am

Inbounds does not make any statements about the behavior of a sub expression involved in an index to the gep

addr =
index = 2*k // this can very well wrap
gep addr, ..., index // addr +index is inbounds

There is a slight of hands that we do in the vectorizer with unit strides where we rely on contiguous adress space.

anemet · June 11, 2015, 5:59am

I’m not sure if inbounds can be used to prove . If an object
%OBJ is allocated at address -1 then “gep inbounds %OBJ 1” is not
poison, but the underlying computation unsigned-overflows.

I think that this should yield poison per langref because the signed arithmetic implied by the indices should be performed with infinite precision. Base is treated as unsigned so 0xff…ff + 1 would be 0x100…00 which would be out of bounds (sort of). See also this which I think is even more explicit: http://llvm.org/docs/GetElementPtr.html#what-happens-if-a-gep-computation-overflows

Adam

Sanjoy_Das1 · June 11, 2015, 6:44am

Base is treated as unsigned so 0xff…ff + 1 would be 0x100…00

This is the part I was missing, thanks for pointing out the FAQ. So
the infinitely precise address computed by a GEP is

zext(Base) + sext(Idx0) + sext(Idx1) ... ?

0x100…00 which would be out of bounds (sort of).

Does this mean, for C++ programs of the form,

for (int *I = array, *E = array + size; I != E; ++I)
...

the memory allocator has to guarantee that array cannot span
[0xff..fffff-31,0xff..fffff] (both inclusive) with size == 32?

-- Sanjoy

anemet · June 11, 2015, 7:02am

Base is treated as unsigned so 0xff…ff + 1 would be 0x100…00

This is the part I was missing, thanks for pointing out the FAQ. So
the infinitely precise address computed by a GEP is

zext(Base) + sext(Idx0) + sext(Idx1) … ?

Yes, that is the way I read it.

0x100…00 which would be out of bounds (sort of).

Does this mean, for C++ programs of the form,

for (int *I = array, *E = array + size; I != E; ++I)
...

the memory allocator has to guarantee that array cannot span
[0xff..fffff-31,0xff..fffff] (both inclusive) with size == 32?

I think so. Address 0 cannot be dereferenced, so you can’t have a valid object spanning across address 0.

Adam

anemet · June 11, 2015, 7:24am

Inbounds does not make any statements about the behavior of a sub expression involved in an index to the gep

addr =
index = 2*k // this can very well wrap
gep addr, ..., index // addr +index is inbounds

… and since 2*k is part of the SCEV ({x,+,2}), the SCEV may wrap. Got it, thanks.

That said, we should still be able to analyze these, I think:

1. {x,+,2} vs {x,+,2}

There is obviously a dep-distance 0 between them regardless of wrapping.

2. {x+8,+,2} vs {x,+,2}

Because of the known small-constant difference, this is either a dep-distance of 8 or 8-Size_of_addr_space. Thus as long as this is normally a forward dep of a small known distance (and with wrapping huge-distance backward dep) we should be able to vectorize this.

What do you think?

Adam

Sanjoy_Das1 · June 11, 2015, 7:48am

Base is treated as unsigned so 0xff…ff + 1 would be 0x100…00

This is the part I was missing, thanks for pointing out the FAQ. So
the infinitely precise address computed by a GEP is

zext(Base) + sext(Idx0) + sext(Idx1) … ?

Yes, that is the way I read it.

0x100…00 which would be out of bounds (sort of).

Does this mean, for C++ programs of the form,

for (int *I = array, *E = array + size; I != E; ++I)
...

the memory allocator has to guarantee that array cannot span
[0xff..fffff-31,0xff..fffff] (both inclusive) with size == 32?

I think so. Address 0 cannot be dereferenced, so you can’t have a valid object spanning across address 0.

I the example I meant to give, [0xff..fffff-31,0xff..fffff] == [-32,
-1] does not span address 0 -- address 0 is the address one byte
outside the range assigned to `array`.

Arnold · June 11, 2015, 1:11pm

Inbounds does not make any statements about the behavior of a sub expression involved in an index to the gep

addr =
index = 2*k // this can very well wrap
gep addr, ..., index // addr +index is inbounds

… and since 2*k is part of the SCEV ({x,+,2}), the SCEV may wrap. Got it, thanks.

That said, we should still be able to analyze these, I think:

1. {x,+,2} vs {x,+,2}

There is obviously a dep-distance 0 between them regardless of wrapping.

Yes obviously.

2. {x+8,+,2} vs {x,+,2}

Because of the known small-constant difference, this is either a dep-distance of 8 or 8-Size_of_addr_space. Thus as long as this is normally a forward dep of a small known distance (and with wrapping huge-distance backward dep) we should be able to vectorize this.

Yes.

anemet · June 11, 2015, 8:40pm

Digging more reveals that the formulation of inbounds matches the C standard — not too surprisingly.

C99 6.5.8/5 Relational operators

If the expression P points to an element of an array object and the expression Q points to the last element of the same array object, the pointer expression Q+1 compares greater than P. In all other cases, the behavior is undefined.

So this works as expected without a potential overflow:

for (char *p = array; p < array + sizeof(array); ++p) …

Adam

chentong319 · June 12, 2015, 1:47pm

Thanks for all the discussions. Can I draw following conclusions from them:

We cannot set NoWrap flag for x+2k only based on the loop induction
If we know x+2k is an inbound array GEP, the NoWrap flag can be set true according to the C standard

Therefore, should I make the following change in LoopAccessAnalysis.cpp:543:

original code:

bool IsInBoundsGEP = isInBoundsGep(Ptr);
bool IsNoWrapAddRec = AR->getNoWrapFlags(SCEV::NoWrapMask);

to

bool IsInBoundsGEP = isInBoundsGep(Ptr);
bool IsNoWrapAddRec = AR->getNoWrapFlags(SCEV::NoWrapMask) || IsInBoundsGEP;

Tong

anemet · June 12, 2015, 9:57pm

Thanks for all the discussions. Can I draw following conclusions from them:

We cannot set NoWrap flag for x+2k only based on the loop induction

If we know x+2k is an inbound array GEP, the NoWrap flag can be set true according to the C standard

No, but we can further analyze the SCEVs to conclude something more meaningful than Dependence::Unknown in certain special cases. I am actually looking at this based on the examples I included in my reply to Arnold. Should have something soon.

Adam

anemet · June 16, 2015, 5:49am

Inbounds does not make any statements about the behavior of a sub expression involved in an index to the gep

addr =
index = 2*k // this can very well wrap
gep addr, ..., index // addr +index is inbounds

… and since 2*k is part of the SCEV ({x,+,2}), the SCEV may wrap. Got it, thanks.

That said, we should still be able to analyze these, I think:

1. {x,+,2} vs {x,+,2}

There is obviously a dep-distance 0 between them regardless of wrapping.

I’ve actually changed direction on this. The problem is that wrapping can introduce loop-carried dependences that are hard to figure out. E.g., consider the IR equivalent of this:

for (unsigned char i = 0; i < 16; i++)
A[128 * i] = A[128 * i] + 2;

(it does not really work in C due to the default promotion of the char arithmetic)

The value written in iteration I will be read in I + 2, so we have a backward dependence with distance 2.

So rather than this approach, I went back trying to prove that the pointer can’t wrap. (The mul is actually marked non-wrapping in the IR just SCEV can’t use it.) See ⚙ D10472 [LAA] Try to prove non-wrapping of pointers if SCEV cannot.

Feedback is welcome. Thanks.

Adam

Topic		Replies	Views
ScalarEvolution invariants around wrapping flags LLVM Dev List Archives	10	96	October 2, 2019
ScalarEvolution "add nsw" question LLVM Dev List Archives	6	87	April 13, 2016
[ScalarEvolution][SCEV] no-wrap flags dependent on order of getSCEV() calls LLVM Dev List Archives	10	91	August 15, 2017
ScalarEvolution Patch LLVM Dev List Archives	2	67	February 22, 2008
ScalarEvolution, HowManyLessThans question for step > 1 LLVM Dev List Archives	1	73	July 22, 2013

Question about NoWrap flag for SCEVAddRecExpr

Related Topics