Potentially unsafe loop optimization

I'm seeing a case where an add instruction with "nuw" is moved to what
I think is an unsafe place and I'm wondering if there really is a bug
here. This comes from the small Ada code:

PROCEDURE C26006A IS

        procedure C_abort with Import, Convention => C,
          External_Name => "abort";

        S1 : STRING (1..3) := "A 1";
        S2 : STRING (1..3) := "A 2";

BEGIN
        FOR C IN CHARACTER'FIRST .. CHARACTER'LAST LOOP
                S1 (2) := C;
                S2 (2) := C;
                if S1 = S2 then
                   C_Abort;
                end if;
        END LOOP;
END C26006A;

and the issue is the loop, which is from (interpreting the i8 as unsigned)
0 to 255 (-1).

The loop end test looks like:

loop.cond.iter: ; preds = %8
  %6 = load i8, i8* %c, align 1, !tbaa !5
  %loop.iter.cond = icmp eq i8 %6, -1
  br i1 %loop.iter.cond, label %loop.exit, label %loop.iter

loop.iter: ; preds = %loop.cond.iter
  %next.loop.var = add nuw i8 %6, 1
  store i8 %next.loop.var, i8* %c, align 1, !tbaa !5
  br label %loop.cond

This looks correct. We do the add after the conditional branch, so
we'll never see it overflow.

But when we optimize with -O2 on this, we get:

loop.cond.iter: ; preds = %loop.cond, %3
  %loop.iter.cond = icmp eq i8 %c.0, -1
  %next.loop.var = add nuw i8 %c.0, 1
  br i1 %loop.iter.cond, label %loop.exit, label %loop.cond

But now the add is done unconditionally, whatever the result of the
comparison. That means that in the case where the icmp is true, we
execute an undefined instruction. When we generate code from this, the
code generator notices that, eliminates the test, and makes this an
an conditional jump back, with the relevant code being:

.LBB0_3: # %loop.cond.iter
                                        # in Loop: Header=BB0_1 Depth=1
        incb %bl
.LBB0_1: # %loop.cond
                                        # =>This Inner Loop Header: Depth=1
        movb %bl, 17(%rsp)
        movb %bl, 1(%rsp)
        movzwl (%rsp), %eax
        xorw 16(%rsp), %ax
        movzbl 2(%rsp), %ecx
        xorb 18(%rsp), %cl
        movzbl %cl, %ecx
        orw %ax, %cx
        jne .LBB0_3
# %bb.2: # in Loop: Header=BB0_1 Depth=1
        callq abort
        jmp .LBB0_3

I think the initial IR is correct, but the loop optimization here is
unsafe: if it wants to promote the add instruction, it needs to remove
any "nsw" or "nuw". Does that sound correct?

The full .ll is below:

; ModuleID = 'c26006a.adb'
source_filename = "c26006a.adb"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16
:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define void @_ada_c26006a() {
entry:
  %s1 = alloca { { i32, i32 }, [3 x i8] }, align 4
  %s2 = alloca { { i32, i32 }, [3 x i8] }, align 4
  %c = alloca i8, align 1
  %0 = bitcast { { i32, i32 }, [3 x i8] }* %s1 to [3 x i8]*
  store [3 x i8] c"A 1", [3 x i8]* %0, align 4, !tbaa !0
  %1 = bitcast { { i32, i32 }, [3 x i8] }* %s2 to [3 x i8]*
  store [3 x i8] c"A 2", [3 x i8]* %1, align 4, !tbaa !0
  store i8 0, i8* %c, align 1, !tbaa !5
  br i1 true, label %loop.cond, label %loop.exit

loop.cond: ; preds = %loop.iter, %entry
  %2 = load i8, i8* %c, align 1, !tbaa !5
  %3 = getelementptr inbounds [3 x i8], [3 x i8]* %0, i64 0, i32 1
  store i8 %2, i8* %3, align 1, !tbaa !8
  %4 = load i8, i8* %c, align 1, !tbaa !5
  %5 = getelementptr inbounds [3 x i8], [3 x i8]* %1, i64 0, i32 1
  store i8 %4, i8* %5, align 1, !tbaa !8
  br i1 false, label %rhs.has.0.dim, label %normal.tests

loop.iter: ; preds = %loop.cond.iter
  %next.loop.var = add nuw i8 %6, 1
  store i8 %next.loop.var, i8* %c, align 1, !tbaa !5
  br label %loop.cond

loop.exit: ; preds = %loop.cond.iter, %entry
  ret void

loop.cond.iter: ; preds = %8
  %6 = load i8, i8* %c, align 1, !tbaa !5
  %loop.iter.cond = icmp eq i8 %6, -1
  br i1 %loop.iter.cond, label %loop.exit, label %loop.iter

7: ; preds = %9, %rhs.has.0.dim
  call void @abort()
  br label %8

8: ; preds = %7, %9, %normal.tests
  br label %loop.cond.iter

rhs.has.0.dim: ; preds = %loop.cond
  br i1 false, label %7, label %normal.tests

normal.tests: ; preds = %rhs.has.0.dim, %loop.cond
  br i1 true, label %9, label %8

9: ; preds = %normal.tests
  %10 = bitcast [3 x i8]* %1 to i8*
  %11 = bitcast [3 x i8]* %0 to i8*
  %12 = call i32 @memcmp(i8* align 4 %10, i8* align 4 %11, i64 3)
  %13 = icmp eq i32 %12, 0
  br i1 %13, label %7, label %8
}

; Function Attrs: nounwind
declare dso_local i32 @memcmp(i8* nocapture nonnull readonly, i8* nocapture nonnull readonly, i64) #0

declare void @abort()

attributes #0 = { nounwind }

!0 = !{!1, !1, i64 0, i64 3}
!1 = !{!2, i64 3, !"c26006a__T2b#TN#AD", !3, i64 0, i64 1, !3, i64 1, i64 1, !3, i64 2, i64 1}
!2 = !{!"Ada Root"}
!3 = !{!4, i64 1, !"character#T0"}
!4 = !{!2, i64 1, !"character#TN"}
!5 = !{!6, !6, i64 0, i64 1}
!6 = !{!7, i64 1, !"T4b#T4"}
!7 = !{!4, i64 1, !"T4b#TN"}
!8 = !{!3, !3, i64 0, i64 1}

Let me follow up on my own message: it seems that even if there's no "nuw"
in the original IR, the optimizer generates one, which seems very odd.

Executing the add instruction that results in an unsigned wrap despite
the nuw flag is not an undefined instruction, but results in a
'poison' value. Only certain uses of a poison value results in
undefined behaviour. See
https://llvm.org/docs/LangRef.html#poison-values for more information.

<llvm-dev@lists.llvm.org>:> The loop end test looks like:

loop.cond.iter: ; preds = %8
  %6 = load i8, i8* %c, align 1, !tbaa !5
  %loop.iter.cond = icmp eq i8 %6, -1
  br i1 %loop.iter.cond, label %loop.exit, label %loop.iter

loop.iter: ; preds = %loop.cond.iter
  %next.loop.var = add nuw i8 %6, 1
  store i8 %next.loop.var, i8* %c, align 1, !tbaa !5
  br label %loop.cond

This looks correct. We do the add after the conditional branch, so
we'll never see it overflow.

I think this loop already has undefined behaviour. If %6 eventually
loads the value 255, %next.loop.var will be poison. Storing poison to
memory is not itself undefined, but write an undefined value.

Michael

Executing the addition is not a problem by itself.

Adding two values that causes an overflow when there is a nuw
results in a poison value. We never use that value because it
is only poison if loop.iter.cond is ture and we exit the loop.

Long story short, from what I can see there is no miscompilation
or change in semantics for that matter.

~ Johannes

Executing the add instruction that results in an unsigned wrap despite
the nuw flag is not an undefined instruction, but results in a
'poison' value. Only certain uses of a poison value results in
undefined behaviour. See
https://llvm.org/docs/LangRef.html#poison-values for more information.

I stand corrected. But then why does the code generator view the
branch as always true?

> loop.cond.iter: ; preds = %8
> %6 = load i8, i8* %c, align 1, !tbaa !5
> %loop.iter.cond = icmp eq i8 %6, -1
> br i1 %loop.iter.cond, label %loop.exit, label %loop.iter
>
> loop.iter: ; preds = %loop.cond.iter
> %next.loop.var = add nuw i8 %6, 1
> store i8 %next.loop.var, i8* %c, align 1, !tbaa !5
> br label %loop.cond
>
> This looks correct. We do the add after the conditional branch, so
> we'll never see it overflow.

I think this loop already has undefined behaviour. If %6 eventually
loads the value 255, %next.loop.var will be poison.

I don't think so because if %6 is loaded with the value 255 (-1), the
branch won't go to loop.iter where the add is done. So we never get
to the add that generates poison.

Also, note that I tried this without the "nuw" on the original add and
the optimizer put that on the add instruction that it made.

Long story short, from what I can see there is no miscompilation
or change in semantics for that matter.

So why does the .s file not contain the loop exit test?

        .text
        .file "c26006a.adb"
        .globl _ada_c26006a # -- Begin function _ada_c26006a
        .p2align 4, 0x90
        .type _ada_c26006a,@function
_ada_c26006a: # @_ada_c26006a
        .cfi_startproc
# %bb.0: # %entry
        pushq %rbx
        .cfi_def_cfa_offset 16
        subq $32, %rsp
        .cfi_def_cfa_offset 48
        .cfi_offset %rbx, -16
        movw $8257, 16(%rsp) # imm = 0x2041
        movb $49, 18(%rsp)
        movw $8257, (%rsp) # imm = 0x2041
        movb $50, 2(%rsp)
        xorl %ebx, %ebx
        jmp .LBB0_1
        .p2align 4, 0x90
.LBB0_3: # %loop.cond.iter
                                        # in Loop: Header=BB0_1 Depth=1
        incb %bl
.LBB0_1: # %loop.cond
                                        # =>This Inner Loop Header: Depth=1
        movb %bl, 17(%rsp)
        movb %bl, 1(%rsp)
        movzwl (%rsp), %eax
        xorw 16(%rsp), %ax
        movzbl 2(%rsp), %ecx
        xorb 18(%rsp), %cl
        movzbl %cl, %ecx
        orw %ax, %cx
        jne .LBB0_3
# %bb.2: # in Loop: Header=BB0_1 Depth=1
        callq abort
        jmp .LBB0_3
.Lfunc_end0:
        .size _ada_c26006a, .Lfunc_end0-_ada_c26006a
        .cfi_endproc
                                        # -- End function
        .section ".note.GNU-stack","",@progbits

What version of llvm are you using? Godbolt is showing trunk and llvm 10 have a conditional branch, llvm 11 does not.

llc generates the return in all version I tried: https://godbolt.org/z/oh3fqh

I ran the IR through clang -O2 on godbolt and didn’t get a return. So I think something is happening in the middle end?

FWIW, clang 11.0 reproduces the bug: https://godbolt.org/z/GvoK63
But clang trunk doesn’t: https://godbolt.org/z/bed7eM

Correct, I missed -1 ebing equal to 255 in i8.

Michael

What version of llvm are you using? Godbolt is showing trunk and
llvm 10 have a conditional branch, llvm 11 does not.

LLVM 11.0.1.

So I looked at it again because the O3 output in IR has the
return for trunk and 11 but for the latter it's not there in assembly,
this was looking interesting: https://godbolt.org/z/EqhW4v

I think there is UB in the jump based on the memcmp because only
one byte of the 3 is initialized. To me it seems there is some type
punning going on and the two integers are used as storage for 3 bytes
instead of the 3 bytes array. I thought this may lead to the
selection of the abort branch and bad consequences. However, on
the IR level we do not recognize abort as noreturn as far as I can tell,
thus that is not it.

I took the IR 11 produced as input and run it again through 11 and trunk,
same result. Then I replaced the bcmp with `f` and the return magically
appeared for 11 too: https://godbolt.org/z/j4T4hW

We are on to something here. bcmp is recognized and we act on it. The
UB described earlier is present there too, bcmp compares 3 bytes in the two
integer struct instead of the 3 byte array.

I can manually do the optimization in the middle end that LLVM 11 does
(removed two redundant stores from the entry) but no change: https://godbolt.org/z/7efdqb
At this point I would guess the backend knows bcmp and utilizes the UB.

Whatever it is, as far as I can tell it is technically correct given the UB
in the input. However, understanding what made the difference might be worth
while nevertheless.

~ Johannes

Looks like the expand memcmp pass expanded the bcmp. Then ran InstructionSimplify on every basic block in the function because a change was made. That decided that the compare was always false. But I’m not sure it had anything to do with the bcmp expansion.

Adding Nikita. I think this might have been fixed by 835104a1141a06ae7821fe2b642b9603e00aa17b [LSR] Drop potentially invalid nowrap flags when switching to post-inc IV (PR46943)

I think there is UB in the jump based on the memcmp because only
one byte of the 3 is initialized. To me it seems there is some type
punning going on and the two integers are used as storage for 3 bytes
instead of the 3 bytes array. I thought this may lead to the
selection of the abort branch and bad consequences. However, on
the IR level we do not recognize abort as noreturn as far as I can tell,
thus that is not it.

The larger test that this was from didn't have an abort there, but a
call to a procedure that printed an error message, so the abort isn't
fundamental to the issue. Because of this, I wasn't looking at the
comparison, which should have been defined. However, I don't see the
UB here. The code does look wrong, because the pointer pun for %0 and
%1 should have been a GEP for the second part of the structure (and
actually, the allocated variable shouldn't include the two integers),
but everything is done with %0 and %1: they're initialized to three
characters, the second character is replaced by the loop variable, and
the three characters are compared.

Looks like the expand memcmp pass expanded the bcmp. Then ran
InstructionSimplify on every basic block in the function because a
change was made. That decided that the compare was always false. But
I'm not sure it had anything to do with the bcmp expansion.

But that's not the relevant compare. That compare is actually
provably false (abort is never called). The compare that's at issue
is the loop end compare.

Sorry, I wasn’t clear on which compare. I was referring to the loop end compare. After the loop strength reduce pass in the llc codegen pipeline we have this IR.

%c.0 = phi i8 [ 0, %entry ], [ %next.loop.var, %loop.cond.iter ]

%next.loop.var = add nuw i8 %c.0, 1
%loop.iter.cond = icmp eq i8 %next.loop.var, 0

Then the memcmp expand pass runs and expands the bcmp. Because this made a change to the IR, it runs InstructionSimplify on every basic block in the function. Including basic blocks that didn’t contain the bcmp. InstructionSimplify notices that loop end icmp uses an add nuw and phi that started at 0. Since the nuw says it can never get back to zero, the icmp is replaced with false.

Had the bcmp not been there, this late run of InstructionSimplify wouldn’t have happened. There are no other optimizations that always run in this part of the pipeline that would see the bad IR created from loop strength reduce and optimize based on it. SelectionDAG can do optimizations with nuw, but it only runs on a single basic block so it won’t see the phi.

Nikita’s patch, 835104a1141a06ae7821fe2b642b9603e00aa17b removes the nuw from the add on trunk. I didn’t look at why llvm 10 works.

Had the bcmp not been there, this late run of InstructionSimplify
wouldn't have happened. There are no other optimizations that always
run in this part of the pipeline that would see the bad IR created
from loop strength reduce and optimize based on it. SelectionDAG can
do optimizations with nuw, but it only runs on a single basic block
so it won't see the phi.

OK, I understand now.

Nikita's patch, 835104a1141a06ae7821fe2b642b9603e00aa17b removes the
nuw from the add on trunk.

That sounds right given that it moved the add before the branch.

I didn't look at why llvm 10 works.

Probably because it didn't do the late run or it didn't spot the phi
would be my guess.

I guess, Craig solved the mystery, nice :slight_smile:

Had the bcmp not been there, this late run of InstructionSimplify
wouldn't have happened. There are no other optimizations that always
run in this part of the pipeline that would see the bad IR created
from loop strength reduce and optimize based on it. SelectionDAG can
do optimizations with nuw, but it only runs on a single basic block
so it won't see the phi.

OK, I understand now.

Nikita's patch, 835104a1141a06ae7821fe2b642b9603e00aa17b removes the
nuw from the add on trunk.

That sounds right given that it moved the add before the branch.

FWIW, It's not the position but the use of the addition that is the problem.
As described in the beginning, computing poison is fine but if you use it to
branch you are in trouble.

That said, don't forget the initial UB in your example, I think you
got the addresses wrong at some point and therefore write into the
integers not the 3 byte array.

~ Johannes