[PATCH] Teaching ScalarEvolution to handle IV=add(zext(trunc(IV)), Step)

Hello all,

I wanted to get some feedback on this patch for ScalarEvolution.

It addresses a performance problem I am seeing for simple benchmark.

Starting with this C code:

01: signed char foo(void)
02: {
03: const int count = 8000;
04: signed char result = 0;
05: int j;
06:
07: for (j = 0; j < count; ++j) {
08: result += (result_t)(3);
09: }
10:
11: return result;
12: }

I end up with this IR feeding into Induction Variable Simplification:

01: define signext i8 @foo() nounwind readnone {
02: entry:
03: br label %for.body
04:
05: for.body: ; preds = %entry, %for.body
06: %j.04 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
07: %result.03 = phi i32 [ 0, %entry ], [ %add, %for.body ]
08: %conv2 = and i32 %result.03, 255
09: %add = add nsw i32 %conv2, 3
10: %inc = add nsw i32 %j.04, 1
11: %cmp = icmp slt i32 %inc, 8000
12: br i1 %cmp, label %for.body, label %for.end
13:
14: for.end: ; preds = %for.body
15: %conv1 = trunc i32 %add to i8
16: ret i8 %conv1
17: }

Unfortunately, the 'and' on line 8 prevents Scalar Evolution from
being able to create an expression for '%add' that it knows how to
evaluate.

The patch detects this pattern in createNodeForPHI and creates an
equivalent expression that can be evaluated.

Note that SCEV translates the 'and' into
   ZeroExtend(Truncate(%result.03, i8), i32)

So in terms of SCEV expressions, we essentially have
   %add[n] = Add(ZeroExtend(Truncate(%add[n-1], i8), i32), 3)

(BTW, I'm no scholar here, just a layman, so my terminology is
probably all wrong).

The patch basically moves the ZeroExtend and Truncate outside of the
recurrence, though it must take into account that for iteration n, the
ZeroExtend and Truncate apply to the value at iteration n-1.

For a constant step this is accomplished by subtracting one step from
the recurrence, performing the Truncate and ZeroExtend, and then
adding the step to the result. In other words:

   %add[n] = Add(
               ZeroExtend(
                 Truncate(
                   Minus(AddRec(Start=0,Step=3)[n], 3),
                   i8),
                 i32),
               3)

It's a little more complicated when the Step is another recurrence,
but essentially the same.

Thoughts?

Matthew C.

0001-Teach-ScalarEvolution-to-handle-IV-add-zext-trunc-IV.patch (10.9 KB)

Any ScalarEvolution experts out there have any thought about this patch?

Thanks,
Matthew Curtis.

Hello all,

I wanted to get some feedback on this patch for ScalarEvolution.

It addresses a performance problem I am seeing for simple benchmark.

Starting with this C code:

01: signed char foo(void)
02: {
03: const int count = 8000;
04: signed char result = 0;
05: int j;
06:
07: for (j = 0; j < count; ++j) {
08: result += (result_t)(3);
09: }
10:
11: return result;
12: }

FWIW, this code is probably not doing what it was intended to do; it's
result is implementation-defined or an implementation-defined signal is
raised (6.3.1.3).

I end up with this IR feeding into Induction Variable Simplification:

01: define signext i8 @foo() nounwind readnone {
02: entry:
03: br label %for.body
04:
05: for.body: ; preds = %entry,
%for.body
06: %j.04 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
07: %result.03 = phi i32 [ 0, %entry ], [ %add, %for.body ]
08: %conv2 = and i32 %result.03, 255
09: %add = add nsw i32 %conv2, 3
10: %inc = add nsw i32 %j.04, 1
11: %cmp = icmp slt i32 %inc, 8000
12: br i1 %cmp, label %for.body, label %for.end
13:
14: for.end: ; preds = %for.body
15: %conv1 = trunc i32 %add to i8
16: ret i8 %conv1
17: }

I'm confused about how you got this IR from that C testcase. What is
result_t? Did you do anything special?

Unfortunately, the 'and' on line 8 prevents Scalar Evolution from
being able to create an expression for '%add' that it knows how to
evaluate.

The patch detects this pattern in createNodeForPHI and creates an
equivalent expression that can be evaluated.

Note that SCEV translates the 'and' into
  ZeroExtend(Truncate(%result.03, i8), i32)

So in terms of SCEV expressions, we essentially have
  %add[n] = Add(ZeroExtend(Truncate(%add[n-1], i8), i32), 3)

(BTW, I'm no scholar here, just a layman, so my terminology is
probably all wrong).

Going with your LLVM IR testcase, this looks right so far.

The patch basically moves the ZeroExtend and Truncate outside of the
recurrence, though it must take into account that for iteration n, the
ZeroExtend and Truncate apply to the value at iteration n-1.

For a constant step this is accomplished by subtracting one step from
the recurrence, performing the Truncate and ZeroExtend, and then
adding the step to the result. In other words:

  %add[n] = Add(
              ZeroExtend(
                Truncate(
                  Minus(AddRec(Start=0,Step=3)[n], 3),
                  i8),
                i32),
              3)

It's a little more complicated when the Step is another recurrence,
but essentially the same.

Something is wrong here. On the first iteration, the %add value is 3.
On the first iteration, your replacement expression's value is 256.

Dan

Dan,

Thanks for the response ...

Hello all,

I wanted to get some feedback on this patch for ScalarEvolution.

It addresses a performance problem I am seeing for simple benchmark.

Starting with this C code:

01: signed char foo(void)
02: {
03: const int count = 8000;
04: signed char result = 0;
05: int j;
06:
07: for (j = 0; j < count; ++j) {
08: result += (result_t)(3);
09: }
10:
11: return result;
12: }

FWIW, this code is probably not doing what it was intended to do; it's
result is implementation-defined or an implementation-defined signal is
raised (6.3.1.3).

Yes. The code is somewhat contrived. It's a much simplified equivalent of a benchmark. :slight_smile:

I end up with this IR feeding into Induction Variable Simplification:

01: define signext i8 @foo() nounwind readnone {
02: entry:
03: br label %for.body
04:
05: for.body: ; preds = %entry,
%for.body
06: %j.04 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
07: %result.03 = phi i32 [ 0, %entry ], [ %add, %for.body ]
08: %conv2 = and i32 %result.03, 255
09: %add = add nsw i32 %conv2, 3
10: %inc = add nsw i32 %j.04, 1
11: %cmp = icmp slt i32 %inc, 8000
12: br i1 %cmp, label %for.body, label %for.end
13:
14: for.end: ; preds = %for.body
15: %conv1 = trunc i32 %add to i8
16: ret i8 %conv1
17: }

I'm confused about how you got this IR from that C testcase. What is
result_t? Did you do anything special?

Sorry, result_t is signed char. I was experimenting with other result types and mistakenly copied an intermediate version of the code.

The IR was produced with an internal build of clang for hexagon (with -O2). The IR produced by the current clang trunk is not quite the same.

Unfortunately, the 'and' on line 8 prevents Scalar Evolution from
being able to create an expression for '%add' that it knows how to
evaluate.

The patch detects this pattern in createNodeForPHI and creates an
equivalent expression that can be evaluated.

Note that SCEV translates the 'and' into
   ZeroExtend(Truncate(%result.03, i8), i32)

So in terms of SCEV expressions, we essentially have
   %add[n] = Add(ZeroExtend(Truncate(%add[n-1], i8), i32), 3)

(BTW, I'm no scholar here, just a layman, so my terminology is
probably all wrong).

Going with your LLVM IR testcase, this looks right so far.

The patch basically moves the ZeroExtend and Truncate outside of the
recurrence, though it must take into account that for iteration n, the
ZeroExtend and Truncate apply to the value at iteration n-1.

For a constant step this is accomplished by subtracting one step from
the recurrence, performing the Truncate and ZeroExtend, and then
adding the step to the result. In other words:

   %add[n] = Add(
               ZeroExtend(
                 Truncate(
                   Minus(AddRec(Start=0,Step=3)[n], 3),
                   i8),
                 i32),
               3)

It's a little more complicated when the Step is another recurrence,
but essentially the same.

Something is wrong here. On the first iteration, the %add value is 3.
On the first iteration, your replacement expression's value is 256.

Dan

Here's how I'm evaluating the expression (in my head):

00: Add(ZeroExtend(Truncate(Minus(AddRec(Start=0,Step=3)[n],3), i8), i32),3)
                                                          >
01: Add(ZeroExtend(Truncate(Minus(AddRec(Start=0,Step=3)[0],3), i8), i32),3)
                                   >
02: Add(ZeroExtend(Truncate(Minus(3,3), i8), i32),3)
                             >
03: Add(ZeroExtend(Truncate(0, i8), i32),3)
                    >
04: Add(ZeroExtend(0, i32),3)
         >
05: Add(0,3)
     >
06: 3

Thanks again.
Matthew Curtis

Here's how I'm evaluating the expression (in my head):

00: Add(ZeroExtend(Truncate(Minus(AddRec(Start=0,Step=3)[n],3), i8), i32),3)
                                                         >
01: Add(ZeroExtend(Truncate(Minus(AddRec(Start=0,Step=3)[0],3), i8), i32),3)
                                  >
02: Add(ZeroExtend(Truncate(Minus(3,3), i8), i32),3)

This step is wrong. The start value of the addrec is 0.

Dan

Ok, so I think I've mis-represented what's really happening.
Ignore my previous statements concerning %add :slight_smile:

Again, given:
05: for.body: ; preds = %entry, %for.body
06: %j.04 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
07: %result.03 = phi i32 [ 0, %entry ], [ %add, %for.body ]
08: %conv2 = and i32 %result.03, 255
09: %add = add nsw i32 %conv2, 3
10: %inc = add nsw i32 %j.04, 1
11: %cmp = icmp slt i32 %inc, 8000
12: br i1 %cmp, label %for.body, label %for.end

LLVM executes the following:

01: createSCEV(%conv2 = and i32 %result.03, 255)
02: calls getSCEV(%result.03)
03: returns (3 + (zext i8 {-3,+,3}<%for.body> to i32))
04: calls getTruncateExpr((3 + (zext i8 {-3,+,3}<%for.body> to i32)), i8)
05: calls getTruncateExpr(3)
06: returns 3
07: calls getTruncateExpt((zext i8 {-3,+,3}<%for.body> to i32))
08: returns {-3,+,3}<%for.body>
09: returns {0,+,3}<%for.body>
10: calls getZeroExtendExpr({0,+,3}<%for.body>, i32);
11: returns (zext i8 {0,+,3}<%for.body> to i32)
12: returns (zext i8 {0,+,3}<%for.body> to i32)

This expression is (I believe) correct for %conv2.

The intent of the patch is to construct the correct (evaluatable)
expression for %result.03 being fed into the computation of %conv2.

Does that make more sense?

Matthew C.

The problem is that this is not a correct expression for %result.03.
It isn't safe to use incorrect expressions in ScalarEvolution even if
it happens to work out in the case you're looking at.

Dan

Any suggestions on how to construct a correct expression? Or otherwise teach ScalarEvolution to handle this case?

BTW, just so I understand, the expression for %result.03 is correct for n > 0, right?

Matthew C.