loop unrolling introduces conditional branch

Hi,

I want to use loop unrolling pass, however, I find that loop unrolling will introduces conditional branch at end of every “unrolled” part. For example, consider the following code

void foo( int n, int array_x[])
{
for (int i=0; i < n; i++)
array_x[i] = i;
}

Then I use this command “opt-3.5 try.bc -mem2reg -loops -loop-simplify -loop-rotate -lcssa -indvars -loop-unroll -unroll-count=3 -simplifycfg -S”, it gives me this IR:

define void @_Z3fooiPi(i32 %n, i32* %array_x) #0 {
%1 = icmp slt i32 0, %n
br i1 %1, label %.lr.ph, label %._crit_edge

.lr.ph: ; preds = %0, %7
%indvars.iv = phi i64 [ %indvars.iv.next.2, %7 ], [ 0, %0 ]
%2 = getelementptr inbounds i32* %array_x, i64 %indvars.iv
%3 = trunc i64 %indvars.iv to i32
store i32 %3, i32* %2
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
%lftr.wideiv = trunc i64 %indvars.iv.next to i32
%exitcond = icmp ne i32 %lftr.wideiv, %n
br i1 %exitcond, label %4, label %._crit_edge

._crit_edge: ; preds = %.lr.ph, %4, %7, %0
ret void

; :4 ; preds = %.lr.ph
%5 = getelementptr inbounds i32* %array_x, i64 %indvars.iv.next
%6 = trunc i64 %indvars.iv.next to i32
store i32 %6, i32* %5
%indvars.iv.next.1 = add nuw nsw i64 %indvars.iv.next, 1
%lftr.wideiv.1 = trunc i64 %indvars.iv.next.1 to i32
%exitcond.1 = icmp ne i32 %lftr.wideiv.1, %n
br i1 %exitcond.1, label %7, label %._crit_edge

; :7 ; preds = %4
%8 = getelementptr inbounds i32* %array_x, i64 %indvars.iv.next.1
%9 = trunc i64 %indvars.iv.next.1 to i32
store i32 %9, i32* %8
%indvars.iv.next.2 = add nuw nsw i64 %indvars.iv.next.1, 1
%lftr.wideiv.2 = trunc i64 %indvars.iv.next.2 to i32
%exitcond.2 = icmp ne i32 %lftr.wideiv.2, %n
br i1 %exitcond.2, label %.lr.ph, label %._crit_edge
}

As you can see, at the end of BB 4 and BB7 there are “add”, “icmp” and “br” instrcutions to check the boundary. I understand this is for the correctness. However, I would expect the loop unrolling can change my code to something like this:

void foo( int n, int array_x[])
{
int j = n%3;
int m = n - j;
for (int i=0; i < m; i+=3){
array_x[i] = i;
array_x[i+1] = i+1;
array_x[i+2] = i+2;
}
for(i=m; i<n; i++)
array_x[i] = i;
}

In this case, the BB4 and BB7 will do not have the “add”, “icmp” and “br” instructions because these BBs can be merged together.

How can I achieve this? Thanks.

Regards,

Xiangyang

One - rather heavy weight - way to do this would be to add the -irce pass after the loop unroll step. InductiveRangeCheckElimination will introduce a post loop so as to eliminate the range checks in the inner loop. This might not be the ideal transformation for this code, but it might get you closer to what you want. A couple of caveats: - 3.5 isn’t recent enough to have a stable IRCE. Download ToT. - IRCE requires profiling information on the branches. I’d start by manually annotating your IR to see if it works, then exploring a profile build if it does. For the record, teaching the unroller to do this transformation (or a creating a new pass) would seem interesting. You might check with Chandler and/or Michael (see recent review threads) for what their plans in this area are.

Hi Xiangyang,

The algorithm for loop unrolling was changed post-3.5 to do more what you’d expect. If you use 3.6 or 3.7 you’ll likely get better results.

Cheers,

James

Hi, James and Philip, Thanks for your help.

Based on your advice, I downloaded llvm-3.7. However, with this new version of LLVM, I have the following errors when I compile my previous code:

g++ -o parser main.o llvm-config --libs all llvm-config --ldflags --system-libs -lpthread -ldl -rdynamic -ltinfo
main.o:(.data.rel.ro._ZTIN4llvm17GetElementPtrInstE[_ZTIN4llvm17GetElementPtrInstE]+0x10): undefined reference to typeinfo for llvm::Instruction' main.o:(.data.rel.ro._ZTIN4llvm8ICmpInstE[_ZTIN4llvm8ICmpInstE]+0x10): undefined reference to typeinfo for llvm::CmpInst’

BTW, in my code, I use LLVM API (IRBuilder and so on) to generate one Module and then use PassManager to add several passes. And my Makefile is pretty simple, it looks like this:

There’s been some recent noise on the mailing list about requiring -fno-rtti;
http://lists.llvm.org/pipermail/llvm-dev/2015-August/089010.html

Could that be it?

Hi, Jeremy,

Thanks for your reply. I tried -fno-rtti yesterday and no luck.

Regards,

Xiangyang

Hi,

I just tried llvm-3.8 (LLVM SVN Repository). With this version, -fno-rtti can help me to compile my code and -irce can help me to do a better job for loop unrolling. However, I still have one question. If I use Clang to compile a piece of c++ code to .bc and then use ‘opt -loop-rotate -loop-unroll -irce’, I can get what I want. I mean, there is no conditional branch at the end of each unrolled part. However, If I use LLVM API such as IRBuilder (CreateAdd, CreateGEP, CreateLoad and so on) to generate the .bc (I dump the two .bc files and they looks like almost same except the variable name), then 'opt -loop-rotate -loop-unroll -irce’I cannot get what I want. I mean, in this case, there is still loop boundary checking (add, compare, conditional branch) at the end of each unrolled part.

I’m really confused about this. Does Clang do something special? Or do I need to do something else to eliminate the unnecessary loop boundary checking at the end of each unrolled part?

Thanks for your help.

Xiangyang

Can you post the two IR online?

Hi, Mehdi,

For example, I have this very simple source code:
void foo( int n, int array_x)
{
    for (int i=0; i < n; i++)
   array_x[i] = i;
}

After I use "clang -emit-llvm -o bc_from_clang.bc -c try.cc", I get
bc_from_clang.bc. With my code (using LLVM IRbuilder API), I get
bc_from_api.bc. Attachment please find thse two files. I also past the IR
here.
******************************** Clang Generate IR Start

bc_from_clang.bc (1.23 KB)

bc_from_api.bc (788 Bytes)

Yes, use an online service like pastebin :slight_smile:

You don’t have defined the DataLayout in the API cases, it should help to do so.

Thanks for your point that out. I just add DataLayout in my code such as “mod->setDataLayout(“e-m:e-i64:64-f80:128-n8:16:32:64-S128”);”, still no luck.

I’m really confused about this. Do I need to add more passes before -loop-unroll?

Actually, My code to generate the IR is really simple. I attach the code here for reference.

I appreciate it if you can give me any suggestion about this loop-unrolling thing.

Regards,

Xiangyang

main.cpp (6.92 KB)

If you do -print-after-all you can see after which pass the IRs start to diverge.

There is something fishy, can you try adding copy/pasting the datalayout in the .ll file?
When I did it I ended up with the same result for the two IR with opt.

Hi Xiangyang,

I also see that the function attributes differ - I think that’s the reason. When I added the attributes and target data-layout to the abi-file, both of them yielded me the same output (minus some blocks and variable names).

Thanks,
Michael