`llvm.$op.with.overflow`, InstCombine and ScalarEvolution

I've run into cases where, because not all of LLVM's optimizations
understand the semantics of the `llvm.$op.with.overflow` intrinsics,
canonicalizing compares to `llvm.$op.with.overflow` ends up preventing
optimization.

For instance, running the following snippet through `opt -indvars`
optimizes `%to.optimize` to `true`, but running it through
`opt -instcombine -indvars` does not.

declare void @side_effect(i1)

define void @foo(i8 %x, i8 %y) {
 entry:
  %sum = add i8 %x, %y
  %e = icmp ugt i8 %x, %sum
  br i1 %e, label %exit, label %loop

 loop:
  %idx = phi i8 [ %x, %entry ], [ %idx.inc, %loop ]
  %idx.inc = add i8 %idx, 1
  %to.optimize = icmp ule i8 %idx, %sum
  call void @side_effect(i1 %to.optimize)
  %c = icmp ule i8 %idx.inc, %sum
  br i1 %c, label %loop, label %exit

 exit:
  ret void
}

This happens because `-instcombine` does the following tranform:

entry:
  %uadd = call { i8, i1 } @llvm.uadd.with.overflow.i8(i8 %x, i8 %y)
  %0 = extractvalue { i8, i1 } %uadd, 0
  %e = extractvalue { i8, i1 } %uadd, 1
  br i1 %e, label %exit, label %loop.preheader

and ScalarEvolution can no longer see through the `extractvalue` of
the call to `llvm.uadd.with.overflow.i8`.

The right way to fix this depends on the intended purpose of the
`llvm.$op.with.overflow` intrinsics. Three solutions I can think of:

* if they're a convenience useful only for better codegen, can the
   transform that instcombine is doing currently (transforming
   compares to `extractvalue` of a `.with.overflow` intrinsic) be
   moved to CodeGenPrep?

* if they're something more fundamental, then maybe they should to be
   promoted to an instruction? They've been around since at least
   llvm 2.6 as far as I can tell. Personally, I seriously doubt this
   is a good idea, given that the semantics of these intrinsics can be
   completely described by a DAG composed of existing instructions.

* add rules to ScalarEvolution to have it understand these intrinsics
   (or maybe even expand them before `-indvars` -- I think
   `-liv-reduce` tries to do this in some cases), but I'd vote for
   keeping this complexity out of ScalarEvolution unless there are
   good reasons why the `.with.overflow` calls need to be introduced
   before codegenprep.

What do you think?

-- Sanjoy

I've run into cases where, because not all of LLVM's optimizations
understand the semantics of the `llvm.$op.with.overflow` intrinsics,
canonicalizing compares to `llvm.$op.with.overflow` ends up preventing
optimization.

For instance, running the following snippet through `opt -indvars`
optimizes `%to.optimize` to `true`, but running it through
`opt -instcombine -indvars` does not.

declare void @side_effect(i1)

define void @foo(i8 %x, i8 %y) {
 entry:
  %sum = add i8 %x, %y
  %e = icmp ugt i8 %x, %sum
  br i1 %e, label %exit, label %loop

 loop:
  %idx = phi i8 [ %x, %entry ], [ %idx.inc, %loop ]
  %idx.inc = add i8 %idx, 1
  %to.optimize = icmp ule i8 %idx, %sum
  call void @side_effect(i1 %to.optimize)
  %c = icmp ule i8 %idx.inc, %sum
  br i1 %c, label %loop, label %exit

 exit:
  ret void
}

This happens because `-instcombine` does the following tranform:

entry:
  %uadd = call { i8, i1 } @llvm.uadd.with.overflow.i8(i8 %x, i8 %y)
  %0 = extractvalue { i8, i1 } %uadd, 0
  %e = extractvalue { i8, i1 } %uadd, 1
  br i1 %e, label %exit, label %loop.preheader

and ScalarEvolution can no longer see through the `extractvalue` of
the call to `llvm.uadd.with.overflow.i8`.

The right way to fix this depends on the intended purpose of the
`llvm.$op.with.overflow` intrinsics. Three solutions I can think of:

* if they're a convenience useful only for better codegen, can the
   transform that instcombine is doing currently (transforming
   compares to `extractvalue` of a `.with.overflow` intrinsic) be
   moved to CodeGenPrep?

Unfortunately, they are not a mere convenience. I've personally spent
effort trying to optimize away overflow checks in InstCombine, it does this
in its InstCombineCalls visitor.

* if they're something more fundamental, then maybe they should to be
   promoted to an instruction? They've been around since at least
   llvm 2.6 as far as I can tell. Personally, I seriously doubt this
   is a good idea, given that the semantics of these intrinsics can be
   completely described by a DAG composed of existing instructions.

I think promoting them would be quite disruptive. Most passes don't care
about these operations and can't do much with them.

* add rules to ScalarEvolution to have it understand these intrinsics
   (or maybe even expand them before `-indvars` -- I think
   `-liv-reduce` tries to do this in some cases), but I'd vote for
   keeping this complexity out of ScalarEvolution unless there are
   good reasons why the `.with.overflow` calls need to be introduced
   before codegenprep.

If we don't care about trying to optimize out overflow checks in
InstCombine, I'd go with moving the complexity to CGP. However, I think
InstCombine is doing the right thing here by forming these. I think there
are two choices that keep our optimizations:
- Teach SCEV, one way or another, about *_with_overflow.
- Don't form overflow intrinsics in InstCombine but instead teach
ProcessUGT_ADDCST_ADD, et al. to use WillNotOverflowUnsignedAdd, et al. to
optimize away the compare.

The second approach sounds much easier but I don't know what we lose by
doing it.

I've run into cases where, because not all of LLVM's optimizations
understand the semantics of the `llvm.$op.with.overflow` intrinsics,
canonicalizing compares to `llvm.$op.with.overflow` ends up preventing
optimization.

For instance, running the following snippet through `opt -indvars`
optimizes `%to.optimize` to `true`, but running it through
`opt -instcombine -indvars` does not.

declare void @side_effect(i1)

define void @foo(i8 %x, i8 %y) {
entry:
 %sum = add i8 %x, %y
 %e = icmp ugt i8 %x, %sum
 br i1 %e, label %exit, label %loop

loop:
 %idx = phi i8 [ %x, %entry ], [ %idx.inc, %loop ]
 %idx.inc = add i8 %idx, 1
 %to.optimize = icmp ule i8 %idx, %sum
 call void @side_effect(i1 %to.optimize)
 %c = icmp ule i8 %idx.inc, %sum
 br i1 %c, label %loop, label %exit

exit:
 ret void
}

This happens because `-instcombine` does the following tranform:

entry:
 %uadd = call { i8, i1 } @llvm.uadd.with.overflow.i8(i8 %x, i8 %y)
 %0 = extractvalue { i8, i1 } %uadd, 0
 %e = extractvalue { i8, i1 } %uadd, 1
 br i1 %e, label %exit, label %loop.preheader

That’s new to me. I guess I missed that.

and ScalarEvolution can no longer see through the `extractvalue` of
the call to `llvm.uadd.with.overflow.i8`.

The right way to fix this depends on the intended purpose of the
`llvm.$op.with.overflow` intrinsics. Three solutions I can think of:

* if they're a convenience useful only for better codegen, can the
  transform that instcombine is doing currently (transforming
  compares to `extractvalue` of a `.with.overflow` intrinsic) be
  moved to CodeGenPrep?

The intrinsics tightly couple the operation with the branch, preventing the optimizer from obscuring the relationship, thereby effectively guaranteeing good codegen. If the intrinsics are needed to expose mid-level optimizations, I’m not aware of it. I think mid-level optimization should generally work on the canonical compare-and-branch form.

I don't understand why we would want to generate intrinsics early unless the frontend does it. It seems that if InstCombine can pattern match these cases, then it can just as easily optimize the comparison directly. Isn’t it already doing that?

* if they're something more fundamental, then maybe they should to be
  promoted to an instruction? They've been around since at least
  llvm 2.6 as far as I can tell. Personally, I seriously doubt this
  is a good idea, given that the semantics of these intrinsics can be
  completely described by a DAG composed of existing instructions.

Right. They are a way for the frontend to communicate directly with codegen, *preventing* optimization from getting in the way. That’s a pretty standard use for intrinsics.

* add rules to ScalarEvolution to have it understand these intrinsics
  (or maybe even expand them before `-indvars` -- I think
  `-liv-reduce` tries to do this in some cases), but I'd vote for
  keeping this complexity out of ScalarEvolution unless there are
  good reasons why the `.with.overflow` calls need to be introduced
  before codegenprep.

What do you think?

I don’t think ScalarEvolution can possibly represent overflow conditions. -liv-reduce clones the data flow, decoupling the overflow check from the arithmetic, just hoping that ISEL can put it back together! I don’t think anyone actively uses at this point, so basically loop optimization is disabled if you use these intrinsics.

If cloning+expanding the intrinsic (-liv-reduce) is really the best way to expose these to SCEV, then clearly InstCombine should not be fusing them to begin with.

An alternative that avoids cloning+expanding the intrinsic would be to allow SCEV to analyze the expression for the intrinsic’s value result, but avoid replacing the intrinsic with a materialized SCEV expression (we would have no way to replace the overflow result anyway). To do that, I think the current uses of SCEVExpander need to be surveyed to ensure that they will handle lazily cloning the intrinsic when materializing the expression for the intrinsic’s value result. This would be wrong for LSR, where it assumes SCEV knows about all uses of the value, but that could probably be worked around or we could just bail on LSR.

So:
- I don’t understand why InstCombine is materializing the intrinsic
- But it is still a good idea to teach SCEV clients how to handle code with the intrinsic

Andy

The intrinsics tightly couple the operation with the branch, preventing the optimizer from obscuring the relationship, thereby effectively guaranteeing good codegen. If the intrinsics are needed to expose mid-level optimizations, I’m not aware of it. I think mid-level optimization should generally work on the canonical compare-and-branch form.

The places within opt (other than InstCombine) that special case the
_with_overflow intrinsics are ConstantFolding, ValueTracking and GVN.
Their extent of "special casing" is just pretending "extractvalue 0
(smul_with_overflow A B)" is "A * B", and the like so I don't think
we'll lose optimization potential by not transforming IR to use
the _with_overflow intrinsics in -instcombine.

The _with_overflow instrinsics are also matched and handled in some
places within SelectionDAG, ISel and some TTIs. I'm more worried
about these as it is possible that missing out on the materializing
_with_overflow may affect codegen quality. But we can always fix this
by moving the "introduce _with_overflow" phase to within CodegenPrep.

An alternative that avoids cloning+expanding the intrinsic would be to allow SCEV to analyze the expression for the intrinsic’s value result, but avoid replacing the intrinsic with a materialized SCEV expression (we would have no way to replace the overflow result anyway).

If I understand you correctly, I suspect this will be hard to do right
because SCEV expressions will
have to track state that they don't track currently (which Value* did
this AddRecExpr come from?). We'll also end up with weird uniqueing
issues -- is "extractvalue 0
(uadd.with.overflow(A, B))" the same SCEVAddExpr* as "A + B"?

-- Sanjoy

If we don't care about trying to optimize out overflow checks in
InstCombine, I'd go with moving the complexity to CGP.

I think instcombine should optimize out overflow checks (as it does
today) without introducing _with_overflow calls. Are there reasons
why such an approach would not work?

However, I think
InstCombine is doing the right thing here by forming these.

I don't quite agree with you on this -- by materializing these
intrinsics InstCombine is making every pass that runs after it less
effective. All of them have to know about the _with_overflow
intrinsics to optimize IR that it could have otherwise optimized.

Another example is GVN -- `opt -gvn` optimizes away %to.optimize but
`opt -instcombine -gvn` does not.

declare void @side_effect(i1)

define void @foo(i8 %x, i8 %y) {
entry:
  %sum = add i8 %x, %y
  %e = icmp ugt i8 %x, %sum
  br i1 %e, label %yes, label %no

yes:
  %to.optimize = icmp ugt i8 %x, %sum
  call void @side_effect(i1 %to.optimize)
  br label %no

no:
  ret void
}

The intrinsics tightly couple the operation with the branch, preventing the optimizer from obscuring the relationship, thereby effectively guaranteeing good codegen. If the intrinsics are needed to expose mid-level optimizations, I’m not aware of it. I think mid-level optimization should generally work on the canonical compare-and-branch form.

The places within opt (other than InstCombine) that special case the
_with_overflow intrinsics are ConstantFolding, ValueTracking and GVN.
Their extent of "special casing" is just pretending "extractvalue 0
(smul_with_overflow A B)" is "A * B", and the like so I don't think
we'll lose optimization potential by not transforming IR to use
the _with_overflow intrinsics in -instcombine.

Sure. The point is, transforming to _with_overflow never exposes *more* optimization in these passes.

The _with_overflow instrinsics are also matched and handled in some
places within SelectionDAG, ISel and some TTIs. I'm more worried
about these as it is possible that missing out on the materializing
_with_overflow may affect codegen quality. But we can always fix this
by moving the "introduce _with_overflow" phase to within CodegenPrep.

Yes. Completely agree.

An alternative that avoids cloning+expanding the intrinsic would be to allow SCEV to analyze the expression for the intrinsic’s value result, but avoid replacing the intrinsic with a materialized SCEV expression (we would have no way to replace the overflow result anyway).

If I understand you correctly, I suspect this will be hard to do right
because SCEV expressions will
have to track state that they don't track currently (which Value* did
this AddRecExpr come from?). We'll also end up with weird uniqueing
issues -- is "extractvalue 0
(uadd.with.overflow(A, B))" the same SCEVAddExpr* as "A + B”?

This is what I do not want to do. I think that approach is totally infeasible. I was suggesting that SCEV ignores overflow (just like ValueTracking/GVN), and the SCEV client never rewrites the intrinsic via SCEVExpander. The _with_overflow value can be substituted but not the overflow bit. I haven’t thought that approach through.

Andy

If we don't care about trying to optimize out overflow checks in
InstCombine, I'd go with moving the complexity to CGP.

I think instcombine should optimize out overflow checks (as it does
today) without introducing _with_overflow calls. Are there reasons
why such an approach would not work?

I agree. I think that approach would work in any case simple enough for InstCombine to otherwise convert the same compare/branch into a with_overflow intrinsic.

I think the problem we run into is that we also want to optimize the _with_overflow cases that are generated by some frontends. We are stuck with some duplication of effort.

Either way, I think the optimizer should handle compare-and-branch idioms first and foremost.

Andy

If we don't care about trying to optimize out overflow checks in
InstCombine, I'd go with moving the complexity to CGP.

I think instcombine should optimize out overflow checks (as it does
today) without introducing _with_overflow calls. Are there reasons
why such an approach would not work?

I agree. I think that approach would work in any case simple enough for InstCombine to otherwise convert the same compare/branch into a with_overflow intrinsic.

I think the problem we run into is that we also want to optimize the _with_overflow cases that are generated by some frontends. We are stuck with some duplication of effort.

Possibly crazy idea, but could we lower incoming x.with.overflow to their IR equivalents and then rematch later? This would seem to get us the best of both worlds provided that there aren't key cases where we can optimize the intrinsics better than the IR patterns.

If we don’t care about trying to optimize out overflow checks in
InstCombine, I’d go with moving the complexity to CGP.

I think instcombine should optimize out overflow checks (as it does
today) without introducing _with_overflow calls. Are there reasons
why such an approach would not work?

I agree. I think that approach would work in any case simple enough for InstCombine to otherwise convert the same compare/branch into a with_overflow intrinsic.

I think the problem we run into is that we also want to optimize the _with_overflow cases that are generated by some frontends. We are stuck with some duplication of effort.

Possibly crazy idea, but could we lower incoming x.with.overflow to their IR equivalents and then rematch later? This would seem to get us the best of both worlds provided that there aren’t key cases where we can optimize the intrinsics better than the IR patterns.

Not a crazy idea at all, but it mostly defeats the purpose of the intrinsic and takes power away from the frontend. Frontends use the intrinsics when it’s important to effectively guarantee the best codegen knowing that the optimizer could otherwise obscure the overflow check. Getting that codegen right is often more important than whatever other optimizations may be inhibited by the intrinsic. If the frontend wants to give the optimizer free reign, then it simply shouldn’t use the instrinsics.

Andy