code generation removes duplicated instructions

Hello,

I am duplicating few instructions in a basic block and splitting it. The following is an example.

bb: ; preds = %bb1
%0 = load i32* %i, align 4
%1 = getelementptr inbounds [100 x i32]* %a, i32 0, i32 %0
store i32 0, i32* %1, align 4
%2 = load i32* %i, align 4
%3 = getelementptr inbounds [100 x i32]* %last_added, i32 0, i32 %2
store i32 -1, i32* %3, align 4
%4 = load i32* %i, align 4
%5 = add nsw i32 %4, 1
store i32 %5, i32* %i, align 4
br label %bb1

==>

%0 = load i32* %i, align 4
%HV14_ = getelementptr inbounds [100 x i32]* %a, i32 0, i32 %0
%1 = getelementptr inbounds [100 x i32]* %a, i32 0, i32 %0
%HVCmp7 = icmp ne i32* %1, %HV14_
br i1 %HVCmp7, label %relExit, label %bb.split

So that HV14_ is a new instruction and I am inserting a comparison to jump to a newly created basic block. Somehow the code generation for arm removes the duplicated instruction and cmp instruction in arm assembly looks as follows.

cmp r0, r0

This defeats the purpose of doing the duplication in the first place. Does anyone have any insight on this? Can anyone suggest some starting points to debug this?

Thanks a lot.
Daya

%0 = load i32* %i, align 4
%HV14_ = getelementptr inbounds [100 x i32]* %a, i32 0, i32 %0
%1 = getelementptr inbounds [100 x i32]* %a, i32 0, i32 %0
%HVCmp7 = icmp ne i32* %1, %HV14_
br i1 %HVCmp7, label %relExit, label %bb.split

So that HV14_ is a new instruction and I am inserting a comparison to jump
to a newly created basic block. Somehow the code generation for arm removes
the duplicated instruction and cmp instruction in arm assembly looks as
follows.
cmp r0, r0

Hi Daya,

This is perfectly legal, since the two registers have exactly the same
value (a[i]) and the comparison will always be the same.

I suppose the rest of the original IR (the other array load, the
stores and the increment) are in the branched basic blocks...

This defeats the purpose of doing the duplication in the first place. Does
anyone have any insight on this? Can anyone suggest some starting points to
debug this?

What is the purpose of the duplication?

cheers,
--renato

Thank you for replying. Yes. The remaining part of the BB is in splitted basic block.

The following is an example code generation for arm and x86 for a same IR BB. In the x86 code I can see that the same computation is done twice and result is stored in two different registers and then these two different registers are used for comparision. By the way I am duplicating instruction and inserting comparison to catch transient errors.

IR BB:

bb: ; preds = %bb1.split
%0 = load i32* %i, align 4
%HV10_ = getelementptr inbounds [100 x i32]* %a, i32 0, i32 %0
%1 = getelementptr inbounds [100 x i32]* %a, i32 0, i32 %0
%HVCmp15 = icmp ne i32* %1, %HV10_
br i1 %HVCmp15, label %relExit, label %bb.split

x86 asm:

.LBB0_1: # %bb

in Loop: Header=BB0_5 Depth=1

leal 972(%esp), %eax
movl 568(%esp), %ecx
imull $4, %ecx, %edx
addl %eax, %edx
imull $4, %ecx, %ecx
addl %eax, %ecx
cmpl %edx, %ecx
movl %ecx, 508(%esp) # 4-byte Spill
jne .LBB0_88

arm asm:

.LBB0_1: @ %bb
@ in Loop: Header=BB0_5 Depth=1
ldr r0, [sp, #444]
add r1, sp, #53, 28 @ 848
add r0, r1, r0, lsl #2
cmp r0, r0
str r0, [sp, #384]
bne .LBB0_88
b .LBB0_2

Thanks
Daya

The following is an example code generation for arm and x86 for a same IR
BB. In the x86 code I can see that the same computation is done twice and
result is stored in two different registers and then these two different
registers are used for comparision.

Yes, but you shouldn't rely on it, since the compiler is free to
optimize that away.

By the way I am duplicating instruction
and inserting comparison to catch transient errors.

I thought so. Try running llc with -O0 or disable specific
optimizations (like dead-code elimination) to keep your comparisons
intact.

But since the two values are identical, it's possible that even so,
both will live in the same register.

cheers,
--renato

Hello,

The code snippet pasted in the previous email are generated at -O0 with llc.

Since I am inserting a new basic block (contains printf statement and program exit) which is jumped upon based on the result of the comparison, the compiler cannot/shouldnot optimize that away by means of DCE or anything else.

The same kind of stuff is happening for the following duplication.

bb6.split: ; preds = %bb6
%33 = load i32* %32, align 4
%34 = load i32* %i, align 4
%HV4_3 = sub nsw i32 %34, %33
%35 = sub nsw i32 %34, %33
%HV4_2 = getelementptr inbounds [100 x i32]* %a, i32 0, i32 %HV4_3
%36 = getelementptr inbounds [100 x i32]* %a, i32 0, i32 %35
%LDCmp6 = icmp ne i32* %36, %HV4_2
br i1 %LDCmp6, label %relExit, label %bb6.split.split

In the above example even the operands are not same and I guess compiler cannot be that smart at -O0. I sense something is wrong with the code generation for ARM.

What other way do you suggest for duplicating since you mentioned I shouldn’t rely on duplication the way I am doing it?

Thanks a lot. I really appreciate your help.
Daya

Since I am inserting a new basic block (contains printf statement and
program exit) which is jumped upon based on the result of
the comparison, the compiler cannot/shouldnot optimize that away by means of
DCE or anything else.

It most certainly can, since the comparison yields always the same
result. The compiler can replace all that by a simple branch to
whatever block always executes.

In the above example even the operands are not same and I guess compiler
cannot be that smart at -O0. I sense something is wrong with the code
generation for ARM.

There're no hard rules stopping any compiler to run DCE or DAG
combining/elimination or whatever when in O0 or any other level.

What other way do you suggest for duplicating since you mentioned I
shouldn't rely on duplication the way I am doing it?
Thanks a lot. I really appreciate your help.

What you need is a way to make sure it won't be optimised away,
possibly even on higher O-levels. You need to add a dependency that
the compiler cannot see through (or is not smart enough at O0 to do
so, at least). Because this is a transient change, as a way to debug
other problems, you're allowed to do ugly stuff.

For example, you can get a fake offset from an argument, and guarantee
that this is always zero. Or you can add a random offset internally to
one of them, and then compare that one is exactly the offset higher
(or lower) than the other. Or you could pass %a and %b as parameters
and memcopy them in the caller.

Anything that would make it difficult for the compiler to see that
both are the same would do...

cheers,
--renato

Hi Renato,

I am trying to add a intrinsic call between the similar two instructions which either I’ll remove or convert to nop in codegen.

Does that kind of seem appropriate for the purpose here?

Thanks
Daya

I am trying to add a intrinsic call between the similar two instructions
which either I'll remove or convert to nop in codegen.

If the two instructions are only similar in your real example, than
you need to make them similar in your test, not identical. Different
offsets, different array...

If them two are identical in the real example as it is in your test,
than you don't need to worry about it, because the back-end will
remove them anyway.

Does that kind of seem appropriate for the purpose here?

If you're adding the builtin call just to mark the code in some way, I
suggest you using metadata. If that builtin has some function that is
necessary in between two similar instructions, than it looks ok.

Sorry if I'm being vague, I'm not fully getting the original problem...

cheers,
--renato

Ok. Let me describe the problem again in some detail.

The following is the original bitcode from a real testcase:

bb7:
%46 = load i32* %j, align 4
%47 = add nsw i32 %46, 1
store i32 %47, i32* %j, align 4
br label %bb8

To protect the operand of the store I duplicate the input chain of operands and insert a comparison to check whether the operand of the stores are correct. As a result of this modification the code looks as follows. Here instructions with HV in there name are extra inserted instructions and relExit BB contains a printf message which is executed if the comparison fails. This is the final code which is supposed to execute on a simulator or real hardware.

bb7:
%46 = load i32* %j, align 4
%47 = add nsw i32 %46, 1
%HV7_ = add nsw i32 %46, 1
%HVCmp13 = icmp ne i32 %47, %HV7_
br i1 %HVCmp13, label %relExit, label %bb7.split

bb7.split:
store i32 %47, i32* %j, align 4
br label %bb8

The following is the arm code generated for the above block (llc with -O0):

.LBB0_26: @ %bb7
@ in Loop: Header=BB0_28 Depth=2
ldr r0, [sp, #440]
add r0, r0, #1
cmp r0, r0
str r0, [sp, #288]
bne .LBB0_88
b .LBB0_27
.LBB0_27: @ %bb7.split
@ in Loop: Header=BB0_28 Depth=2
ldr r0, [sp, #288]
str r0, [sp, #440]

The code generated for x86 for the same two blocks looks like as follows (again llc with -O0):

.LBB0_26: # %bb7

in Loop: Header=BB0_28 Depth=2

movl 564(%esp), %eax
movl %eax, 400(%esp) # 4-byte Spill
addl $1, %eax
movl 400(%esp), %ecx # 4-byte Reload
addl $1, %ecx
cmpl %ecx, %eax
movl %eax, 396(%esp) # 4-byte Spill
jne .LBB0_88

BB#27: # %bb7.split

in Loop: Header=BB0_28 Depth=2

movl 396(%esp), %eax # 4-byte Reload
movl %eax, 564(%esp)

The problem here is in case of ARM code generation the backend has effectively removed the comparison (its comparing same register) of different computations. I want to have that comparison in the final code. The final code would be running on a hardware or simulator so that whenever a transient error occurs I can detect that and trigger a recovery. My goal here is to achieve the duplicated computation (same as visible in x86 asm e.g. adding the one twice) for ARM. Is there a way I can achieve it?

Does the problem make more sense now? Thank you for your help.

Thanks
Daya

You can hide the fact that the base values are the same from CodeGen
by writing the IR equivalent to something like the following C:
char* newval = val
asm("" : +r(newval));
if (&newval[3] != &val[3]) // Looks like a non-trivial comparison

-Eli