XOR Optimization

Hi Daniel,

Hi folks,

I couldn't find a specific XOR (OR and AND) optimization on llvm, and
therefore I am about to implement it.
But first I would like to check with you guys that it really does not exist.

For a simple loop like this:

nbits = 128;
bit_addr = 0;
  while(nbits--)
{
  bindex=bit_addr>>5; /* Index is number /32 */
bitnumb=bit_addr % 32; /* Bit number in longword */
  bitmap[bindex]^=(1L<<bitnumb);
bit_addr++;
}

The -O3 set of optimizations generates a code like this:

entry:
   br label %while.body

while.body: ; preds = %while.body,
%entry
   %0 = phi i32 [ 0, %entry ], [ %inc.3, %while.body ]
   %shr = lshr i32 %0, 5
   %rem = and i32 %0, 28
   %shl = shl i32 1, %rem
   %arrayidx = getelementptr inbounds i32* %bitmap, i32 %shr
   %tmp6 = load i32* %arrayidx, align 4
   %xor = xor i32 %tmp6, %shl
   %rem.1 = or i32 %rem, 1
   %shl.1 = shl i32 1, %rem.1
   %xor.1 = xor i32 %xor, %shl.1
   %rem.2 = or i32 %rem, 2
   %shl.2 = shl i32 1, %rem.2
   %xor.2 = xor i32 %xor.1, %shl.2
   %rem.3 = or i32 %rem, 3
   %shl.3 = shl i32 1, %rem.3
   %xor.3 = xor i32 %xor.2, %shl.3
   store i32 %xor.3, i32* %arrayidx, align 4
   %inc.3 = add i32 %0, 4
   %exitcond.3 = icmp eq i32 %inc.3, 128
   br i1 %exitcond.3, label %while.end, label %while.body

while.end: ; preds = %while.body
   ret void

It is clear that we are able to fold all XORs into a single XOR, and the
same happens to all SHLs and ORs.
I am using -O3, but the code is not optimized, so I am assuming there is no
optimization for this case. Am I correct?

The loop is being unrolled by a factor of 4. This breaks the artificial
dependence between loop iterations, and should yield a substantial improvement
on machines with enough registers and functional units to take advantage of it.
The fact that the loop is unrolled explains why the XORs, SHLs, and ORs are not
folded into 1.

If yes, I have a few other questions:

  - Do you know of any other similar optimization that could do something
here but is not being triggered for some reason??

Try compiling with -Os instead of -O3. I'm guessing this will turn off loop
unrolling.

  - Do you know why a OR instruction is used for increments? instead of using
a INC or ADD?

OR is a simpler operation than ADD at the hardware level, and
usually faster and more energy-efficient, so it should be preferred when the compiler
can statically infer that its usage is correct.

  - Is there a straight forward way to know if an instruction belongs to a
loop? (just curiosity)

I'll defer to others on this one.

Thanks very much

Daniel Nicacio

Best,
Matt

Hi,

Hi Daniel,

Hi folks,

I couldn’t find a specific XOR (OR and AND) optimization on llvm, and
therefore I am about to implement it.
But first I would like to check with you guys that it really does not exist.

For a simple loop like this:

nbits = 128;
bit_addr = 0;
while(nbits–)
{
bindex=bit_addr>>5; /* Index is number /32 /
bitnumb=bit_addr % 32; /
Bit number in longword */
bitmap[bindex]^=(1L<<bitnumb);
bit_addr++;
}

The -O3 set of optimizations generates a code like this:

entry:
br label %while.body

while.body: ; preds = %while.body,
%entry
%0 = phi i32 [ 0, %entry ], [ %inc.3, %while.body ]
%shr = lshr i32 %0, 5
%rem = and i32 %0, 28
%shl = shl i32 1, %rem
%arrayidx = getelementptr inbounds i32* %bitmap, i32 %shr
%tmp6 = load i32* %arrayidx, align 4
%xor = xor i32 %tmp6, %shl
%rem.1 = or i32 %rem, 1
%shl.1 = shl i32 1, %rem.1
%xor.1 = xor i32 %xor, %shl.1
%rem.2 = or i32 %rem, 2
%shl.2 = shl i32 1, %rem.2
%xor.2 = xor i32 %xor.1, %shl.2
%rem.3 = or i32 %rem, 3
%shl.3 = shl i32 1, %rem.3
%xor.3 = xor i32 %xor.2, %shl.3
store i32 %xor.3, i32* %arrayidx, align 4
%inc.3 = add i32 %0, 4
%exitcond.3 = icmp eq i32 %inc.3, 128
br i1 %exitcond.3, label %while.end, label %while.body

while.end: ; preds = %while.body
ret void

It is clear that we are able to fold all XORs into a single XOR, and the
same happens to all SHLs and ORs.
I am using -O3, but the code is not optimized, so I am assuming there is no
optimization for this case. Am I correct?

The loop is being unrolled by a factor of 4. This breaks the artificial
dependence between loop iterations, and should yield a substantial improvement
on machines with enough registers and functional units to take advantage of it.
The fact that the loop is unrolled explains why the XORs, SHLs, and ORs are not
folded into 1.

I think he is trying to say this expression generated by unrolling by a factor of 4 can indeed be folded into a single XOR, SHL and OR.

If yes, I have a few other questions:

  • Do you know of any other similar optimization that could do something
    here but is not being triggered for some reason??

Try compiling with -Os instead of -O3. I’m guessing this will turn off loop
unrolling.

  • Do you know why a OR instruction is used for increments? instead of using
    a INC or ADD?

OR is a simpler operation than ADD at the hardware level, and
usually faster and more energy-efficient, so it should be preferred when the compiler
can statically infer that its usage is correct.

I haven’t seen a machine in which OR is faster than ADD nor more energy-efficient. They’re all done by the same ALU circuitry which delays the pipeline by its worstcase path timing. So, for modern processor hardware purposes, OR is exactly equal ADD. Transforming ADD to OR isn’t strenght reduction at all. Maybe this is benefical only if you have a backend generating circuitry (programming FPGAs).

Hi-

I haven't seen a machine in which OR is faster than ADD nor more energy-efficient. They're all done by the same ALU circuitry which delays the pipeline by its worstcase path timing. So, for modern processor hardware purposes, OR is exactly equal ADD. Transforming ADD to OR isn't strenght reduction at all. Maybe this is benefical only if you have a backend generating circuitry (programming FPGAs).

I believe that in cases where ADD and OR are equivalent, LLVM prefers the latter because it's easier to reason about the bits in the result of an OR in complex cases. The x86 backend, for instance, transforms ORs in such cases back into adds, presumably in case it may be matched to an lea where that's beneficial.

Alistair

“The fact that the loop is unrolled explains why the XORs, SHLs, and ORs are not folded into 1.”

I dont see why the unrolling explains it.

"I think he is trying to say this expression generated by unrolling by a factor of 4 can indeed be folded into a single XOR, SHL and OR. "

Precisely. The code generated by unrolling can be folded into a single XOR and SHL. And even if it was not inside a loop, it can still be optimized. What I want to know is: is there any optimization supposed to optimize this code, but for some reason it thinks it is not possible, or there is no optimization for that situation at all?

Thanks for the help guys

Hi Daniel,

Precisely. The code generated by unrolling can be folded into a single XOR and
SHL. And even if it was not inside a loop, it can still be optimized. What I
want to know is: is there any optimization supposed to optimize this code, but
for some reason it thinks it is not possible, or there is no optimization for
that situation at all?

it could be a phase ordering problem. If you run "opt -std-compile-opts" on the
unsatisfactory bitcode, does it clean up all the bit fiddling?

Ciao, Duncan.

Hi Duncan,

when I run “opt -std-compile-opts” on the original source code it has the same output of O3.
when I run “opt -std-compile-opts” on the -O3 optimized code, things get even more weird, it outputs the following code:

while.body: ; preds = %while.body, %entry
%indvar = phi i32 [ 0, %entry ], [ %indvar.next.3, %while.body ]
%tmp = shl i32 %indvar, 2
%0 = lshr i32 %indvar, 3
%shr = and i32 %0, 134217727
%rem = and i32 %tmp, 16
%shl = shl i32 1, %rem
%arrayidx = getelementptr inbounds i32* %bitmap, i32 %shr
%tmp6 = load i32* %arrayidx, align 4
%rem.1 = or i32 %rem, 1
%shl.1 = shl i32 1, %rem.1
%rem.2 = or i32 %rem, 2
%shl.2 = shl i32 1, %rem.2
%rem.3 = or i32 %rem, 3
%shl.3 = shl i32 1, %rem.3
%xor = xor i32 %shl, %tmp6
%xor.1 = xor i32 %xor, %shl.3
%xor.2 = xor i32 %xor.1, %shl.2
%xor.3 = xor i32 %xor.2, %shl.1
%rem.11 = or i32 %rem, 4
%shl.12 = shl i32 1, %rem.11
%rem.1.1 = or i32 %rem, 5
%shl.1.1 = shl i32 1, %rem.1.1
%rem.2.1 = or i32 %rem, 6
%shl.2.1 = shl i32 1, %rem.2.1
%rem.3.1 = or i32 %rem, 7
%shl.3.1 = shl i32 1, %rem.3.1
%xor.13 = xor i32 %shl.12, %xor.3
%xor.1.1 = xor i32 %xor.13, %shl.3.1
%xor.2.1 = xor i32 %xor.1.1, %shl.2.1
%xor.3.1 = xor i32 %xor.2.1, %shl.1.1
%rem.24 = or i32 %rem, 8
%shl.25 = shl i32 1, %rem.24
%rem.1.2 = or i32 %rem, 9
%shl.1.2 = shl i32 1, %rem.1.2
%rem.2.2 = or i32 %rem, 10
%shl.2.2 = shl i32 1, %rem.2.2
%rem.3.2 = or i32 %rem, 11
%shl.3.2 = shl i32 1, %rem.3.2
%xor.26 = xor i32 %shl.25, %xor.3.1
%xor.1.2 = xor i32 %xor.26, %shl.3.2
%xor.2.2 = xor i32 %xor.1.2, %shl.2.2
%xor.3.2 = xor i32 %xor.2.2, %shl.1.2
%rem.37 = or i32 %rem, 12
%shl.38 = shl i32 1, %rem.37
%rem.1.3 = or i32 %rem, 13
%shl.1.3 = shl i32 1, %rem.1.3
%rem.2.3 = or i32 %rem, 14
%shl.2.3 = shl i32 1, %rem.2.3
%rem.3.3 = or i32 %rem, 15
%shl.3.3 = shl i32 1, %rem.3.3
%xor.39 = xor i32 %shl.38, %xor.3.2
%xor.1.3 = xor i32 %xor.39, %shl.3.3
%xor.2.3 = xor i32 %xor.1.3, %shl.2.3
%xor.3.3 = xor i32 %xor.2.3, %shl.1.3
store i32 %xor.3.3, i32* %arrayidx, align 4
%indvar.next.3 = add i32 %indvar, 4
%exitcond.3 = icmp eq i32 %indvar.next.3, 32
br i1 %exitcond.3, label %while.end, label %while.body

while.end: ; preds = %while.body
ret void

Thanks for the reply,

Daniel

After a few more tests, I found out that if we set -unroll-threshold to a value large enough, and run “opt -std-compile-opts” or “opt -O3” 3 times, the unroll will be able to unroll the original loop 32 times, and when you have it unrolled for at least 32 times a optimization is triggered, folding it to a single “%xor.3.3.1 = xor i32 %tmp6, -1” (dont know why it does not transform it into a NOT though).

Therefore, the optimization that I want is already there somewhere, but is not triggered when llvm unrolls the loop less than 32 times.

My goal now is to make it work for smaller unrolled loops, like for 4, 8, and 16.

Does anyone knows which function gives the information that for 32 unrolled steps it is possible to optimize but not for smaller numbers?

I also would like to see why the “XOR A, -1” is not turned into a NOT, any hints on that are welcome.

Thanks,

Daniel Nicacio

2011/7/26 Daniel Nicácio <dnicacios@gmail.com>

Probably because NOT (like NEG) doesn't exist :slight_smile:

<http://llvm.org/docs/LangRef.html#instref>

I assume the decision was made that it wasn't worth adding the extra
unary instructions when they can easily be handled in codegen by
matching "XOR X, -1" or "SUB 0, X".

~ Scott

Hey guys,

I still think there is no optimization doing what I want. When the loop is unrolled 32 times, llvm is able to identify that the loop is working on a whole word, it finds some constants and propagate them, resulting in the folded XOR instruction. However, when the loop operates on some bits of the word, llvm is still not able to fold those XOR, even when the operated bits does not overlap each other.

Therefore, I am implementing the following optimization for folding XOR instructions working on bits (it still must be extended to OR and AND instructions). Any comments and critics are appreciated.

Basically, I try to identify a chain of XOR instructions and fold it. The below image illustrate this:

In order to do that I am adding an additional function call to “visitXor()” in Instruction Combining.

My Optimization function is attached as a patch file. (the diff was made using the 2.9 release version of llvm).

Thanks

Daniel Nicacio

InstCombineAndOrXor.diff (635 Bytes)

InstructionCombining.diff (6.9 KB)

Hi Daniel,

I’m not the best person to review your patch but a couple of things:

  1. Also attach a testcase as in test/Transforms/InstCombine
  2. Generate your patch from svn “TOT”
  3. Send patches directly to llvm-commits

2011/7/28 Daniel Nicácio <dnicacios@gmail.com>