Modifying LoopUnrollingPass

Hi Zhoulai,

I am trying to modify "LoopUnrollPass" in llvm which produces multiple
copies of loop equal to the loop unroll factor.Currently, using multicore
architecture, say 3 for example and the execution goes like:

for 3 cores if there are 9 iterations of loop
core instruction
1 0,3,6
2 1,4,7
3 2,5,8

But I want to to modify such that it can execute in following way:

core instruction
1 0,1,2
2 3,4,5
3 6,7,8

I am not able to get where to modify for this. I tried creating a sample
pass using original LoopUnrollPass code and run "make", I received
following error:

loopunrollp.cpp:210:1: error: ‘void
llvm::initializeLoopUnrollpPass(llvm::PassRegistry&)’ should have been
declared inside ‘llvm’
/bin/rm: cannot remove
`/home/yaduveer/RP/LLVM/llvm/lib/Transforms/loopunrollp/Debug+Asserts/loopunrollp.d.tmp':
No such file or directory

Please help

Thanks,
Yaduveer

Hi Yaduveer,

As far as I remember, unroller in LoopVectorizer pass does what you want to achieve (look for a message "LV: Trying to at least unroll the loops.” to locate this in the code).

Michael

Hi Michael,

Thank you very much!
I will try this.

Hi Yaduveer,

Vectorizer probably fails because it expects a loop in a certain form, and to convert a loop to this form one need to run some other passes first. For example, when you run “opt -O3”, the following passes are invoked:
-targetlibinfo -tti -no-aa -tbaa -scoped-noalias -assumption-cache-tracker -basicaa -ipsccp -globalopt -deadargelim -domtree -instcombine -simplifycfg -basiccg -prune-eh -inline-cost -inline -functionattrs -argpromotion -sroa -domtree -early-cse -lazy-value-info -jump-threading -correlated-propagation -simplifycfg -domtree -instcombine -tailcallelim -simplifycfg -reassociate -domtree -loops -loop-simplify -lcssa -loop-rotate -licm -loop-unswitch -instcombine -scalar-evolution -loop-simplify -lcssa -indvars -loop-idiom -loop-deletion -loop-unroll -memdep -mldst-motion -domtree -memdep -gvn -memdep -memcpyopt -sccp -domtree -bdce -instcombine -lazy-value-info -jump-threading -correlated-propagation -domtree -memdep -dse -loops -loop-simplify -lcssa -licm -adce -simplifycfg -domtree -instcombine -barrier -float2int -domtree -loops -loop-simplify -lcssa -loop-rotate -branch-prob -block-freq -scalar-evolution -loop-accesses -loop-vectorize -instcombine -scalar-evolution -slp-vectorizer -simplifycfg -domtree -instcombine -loops -loop-simplify -lcssa -scalar-evolution -loop-unroll -instsimplify -loop-simplify -lcssa -licm -scalar-evolution -alignment-from-assumptions -strip-dead-prototypes -globaldce -constmerge -verify

To get this list, you can use the following command:
llvm-as < /dev/null | opt -O3 -disable-output -debug-pass=Arguments

Now, when you get a list of passes to run before the vectorizer, you need to get ‘unoptimized’ IR and run the passes on it - that should give you IR just before the vectorizer.

To get the unoptimized IR, you could use
clang -O3 -mllvm -disable-llvm-optzns -emit-llvm your_source.c -S -o unoptimized_ir.ll
(Please note that we use “-O3 -mllvm -disable-llvm-optzns”, not just “-O0” - that allows us to run analysis passes, but not transformations)

Now you run ‘opt’ with passes preceding the vectorizer to get IR before vectorization:
opt -targetlibinfo -tti -no-aa -tbaa …… -scalar-evolution -loop-accesses unoptimized_ir.ll -S -o ir_before_loop_vectorize.ll
(you might want to remove verifier passes from the list)

And after this you are ready to run the vectorizer:
opt -loop-vectorize ir_before_loop_vectorize.ll -S -o ir_after_loop_vectorize.ll

Hopefully, that’ll resolve the issues you are facing.

Thanks,
Michael

Optimization passes running before LoopVectorizer should be able to combine the two statements (this should be happening in O1. Pls check)

arr[i] = a + i
sum += arr[i]

to

sum += a + i

Not sure, why are you using the array there.

  • Suyog

Hi Yaduveer,

I may be missing something, but it seems you're trying to get
different cores running parts of the loop, which you'll get for free
if you use OpenMP.

The loop unroller is meant to increase load/store speed by loading a
lot of values, then operating on all of them, then writing back
altogether. Even if not vectorized (SIMD, not threads), it still has
some performance gains. Vectorization is also only about SIMD engines
in a single core (doing 2/4/8 operations at the same time), nothing to
do with using multiple cores.

Before you jump head first into the source, you need to ask yourself
the right question: What do you want to do?

1) Use all cores, dividing the loop into multiple cores, one block at
a time. Use OpenMP for this.
2) Use your SIMD engine on each core. Use the loop vectorizer for this.
3) Or is it just about load/store speed ups? The loop unroller will
help you here.

You can also use all three at the same time, having all cores running
their SIMD engines with a massively unrolled loop by using all of the
above.

cheers,
--renato

Smells like a reduced-too-far example. :slight_smile:

cheers,
--renato

Hi All,

Thank you for your suggestions.

@Michael: I tried your suggestion. I could run “Loop Vectorizer” successfully.

But my problem is not solved.

@Renato:

I want to do the 1st type operation:

  1. Use all cores, dividing the loop into multiple cores, one block at
    a time. But I don’t want to use “OpenMP” for this.

I don’t want the user to mention OpenMP command in the program but I want to implement OpenMP logic by writing a new pass along with loop unrolling. Could you please suggest how can I proceed with this? Is this possible to do so ?

1) Use all cores, dividing the loop into multiple cores, one block at
a time. But I don't want to use "OpenMP" for this.

Right, that is *exactly* what OpenMP does, I'm not sure why you don't
want to use it.

Just unrolling the loops will not get you multi-threaded behaviour,
nor will vectorizing the loops. You need a thread library and OpenMP
provides you one. You could also use pthreads or MPI, but not without
changing the source code a lot more than OpenMP, and with the same
amount work to get the libraries working.

I don't want the user to mention OpenMP command in the program but I want to
implement OpenMP logic by writing a new pass along with loop unrolling.

This doesn't make sense. Implementing OpenMP logic in the loop
unroller to avoid users knowing about OpenMP will never be accepted
upstream, and honestly, it's the wrong place for doing this, as you
*will* need run-time libraries as well as heavy IR transformations
that are already implemented.

Could you please suggest how can I proceed with this? Is this possible to do
so ?

Why not hide OpenMP command line options and pragmas from the user by
doing source-to-source transformation in Clang?

First, get Clang's AST for a loop with and without OpenMP information,
then detect the loops you want to split and add the metadata to the
AST before lowering to IR.

This way, the whole work will be done by the already existing OpenMP
implementation and run-time libraries, and the validation will be done
by the compiler in the same way, so unless you don't use SIMD pragmas
in the wrong way, you should be ok. No user will need to define OpenMP
when they use your front-end wrapper.

cheers,
--renato

Hi Renato,

Thanks for the help.

I am trying to follow the AST way. I tried seeing the AST contents by using following command:

clang -Xclang -ast-dump -fsyntax-only loop.c

This is giving me some AST output( I believe so) but I am having two issue:

  1. I am not able to put this output in a file as Its showing following error:

yaduveer@yaduveer-Inspiron-3542:~/RP$ clang -Xclang -ast-dump -fsyntax-only loop1d.c | llvm-dis -o ast.txt
llvm-dis: Invalid bitcode signature
clang: error: unable to execute command: Broken pipe
clang: error: clang frontend command failed due to signal (use -v to see invocation)
clang version 3.6.0 (trunk 225627) (llvm/trunk 225626)
Target: x86_64-unknown-linux-gnu
Thread model: posix
clang: note: diagnostic msg: PLEASE submit a bug report to http://llvm.org/bugs/ and include the crash backtrace, preprocessed source, and associated run script.
clang: note: diagnostic msg:

llvm-dis works on LLVM bitcode and not on clang ASTs.
You are mixing frontend and LLVM IR.
Please see http://llvm.org/docs/CommandGuide/llvm-dis.html for more info.

Regards,
Suyog

ASTs are normally multiple times larger than source code, as they
contain a lot of semantics and cross references between the
statements.

I recommend you to start a new thread at cfe-dev@cs.uiuc.edu (the
Clang mailing list) to learn how to detect AST patterns to add your
undercover OpenMP pragmas to loops. More importantly, how to do that
in an out-of-tree patch, or even as a Clang extra tool.

They'll be able to help you a lot more than this crowd. :slight_smile:

cheers,
--renato

Hi Renato,

I have started a new thread regarding AST as per your suggestion. I have one more query. Could you please suggest me some command to see “dependency analysis” of “basis blocks” in IR.

I mean something by which I can understand how Basis Blocks are connected?

Like this?

http://icsweb.inf.unisi.ch/cms/images/stories/ICS/slides/llvm-graphs.pdf

cheers,
--renato