I couldn't find a specific XOR (OR and AND) optimization on llvm, and
therefore I am about to implement it.
But first I would like to check with you guys that it really does not exist.
For a simple loop like this:
nbits = 128;
bit_addr = 0;
bindex=bit_addr>>5; /* Index is number /32 */
bitnumb=bit_addr % 32; /* Bit number in longword */
The -O3 set of optimizations generates a code like this:
br label %while.body
while.body: ; preds = %while.body,
%0 = phi i32 [ 0, %entry ], [ %inc.3, %while.body ]
%shr = lshr i32 %0, 5
%rem = and i32 %0, 28
%shl = shl i32 1, %rem
%arrayidx = getelementptr inbounds i32* %bitmap, i32 %shr
%tmp6 = load i32* %arrayidx, align 4
%xor = xor i32 %tmp6, %shl
%rem.1 = or i32 %rem, 1
%shl.1 = shl i32 1, %rem.1
%xor.1 = xor i32 %xor, %shl.1
%rem.2 = or i32 %rem, 2
%shl.2 = shl i32 1, %rem.2
%xor.2 = xor i32 %xor.1, %shl.2
%rem.3 = or i32 %rem, 3
%shl.3 = shl i32 1, %rem.3
%xor.3 = xor i32 %xor.2, %shl.3
store i32 %xor.3, i32* %arrayidx, align 4
%inc.3 = add i32 %0, 4
%exitcond.3 = icmp eq i32 %inc.3, 128
br i1 %exitcond.3, label %while.end, label %while.body
while.end: ; preds = %while.body
It is clear that we are able to fold all XORs into a single XOR, and the
same happens to all SHLs and ORs.
I am using -O3, but the code is not optimized, so I am assuming there is no
optimization for this case. Am I correct?
The loop is being unrolled by a factor of 4. This breaks the artificial
dependence between loop iterations, and should yield a substantial improvement
on machines with enough registers and functional units to take advantage of it.
The fact that the loop is unrolled explains why the XORs, SHLs, and ORs are not
folded into 1.
If yes, I have a few other questions:
- Do you know of any other similar optimization that could do something
here but is not being triggered for some reason??
Try compiling with -Os instead of -O3. I'm guessing this will turn off loop
- Do you know why a OR instruction is used for increments? instead of using
a INC or ADD?
OR is a simpler operation than ADD at the hardware level, and
usually faster and more energy-efficient, so it should be preferred when the compiler
can statically infer that its usage is correct.
- Is there a straight forward way to know if an instruction belongs to a
loop? (just curiosity)
I'll defer to others on this one.
Thanks very much