LLVM and OpenMP

I am involved in a project where one of the aims is to
study the effects of different parallelization strategies
on efficiency, power consumption etc. The idea is to
do automated design space exploration by varying some
parameters (e.g. number of tasks) and measuring their effect.

Since we are already using LLVM for other purposes, we thought
about using LLVM for analysis and then OpenMP for compilation.
The idea was to use the LLVM back end to spit out C code with
OpenMP directives. However, looking at the C code that llc
produces, it seems that this might be a non-starter, as loops
have already been turned into gotos in the generated C.

I suspect we may need to do something else, but if anyone has
any bright ideas on how to use LLVM for this purpose, I'd be
very grateful.

TIA,

I am involved in a project where one of the aims is to
study the effects of different parallelization strategies
on efficiency, power consumption etc. The idea is to
do automated design space exploration by varying some
parameters (e.g. number of tasks) and measuring their effect.

Since we are already using LLVM for other purposes,

cool. Is it listed at http://llvm.org/Users.html ?

we thought
about using LLVM for analysis and then OpenMP for compilation.
The idea was to use the LLVM back end to spit out C code with
OpenMP directives. However, looking at the C code that llc
produces, it seems that this might be a non-starter, as loops
have already been turned into gotos in the generated C.

I suspect we may need to do something else, but if anyone has
any bright ideas on how to use LLVM for this purpose, I'd be
very grateful.

If you're using llvm-gcc-4.0 to get llvm bitcode for analysis then front-end is ignoring OpenMP pragmas. if you're trying to use llvm-gcc-4.2 then most likely it won't work, because llvm bitcode emitter has not been updated to handle OpenMP. If you're getting llvm bitcode (without going through llvm-gcc) and somehow annotating it with OpenMP instructions (I don't know how) then also I CBackend, which is used by llc, won't be able to handle it at the moment. In simple words, LLVM is not OpenMP ready yet.

The easiest route would be to update llvm bitcode converter in llvm-gcc-4.2 such that llvm bitcode converter operates on GCC trees after OpenMP lowering phase is executed.

If you're interested, we welcome contribution on OpenMP front.

Devang,

The easiest route would be to update llvm bitcode converter in llvm-
gcc-4.2 such that llvm bitcode converter operates on GCC trees after
OpenMP lowering phase is executed.

If you're interested, we welcome contribution on OpenMP front.

I already slightly looked on OpenMP in llvm-gcc 4.2.

Internally OpenMP lowering in gcc is split into two phases. Funny, but
llvm converter is run between these phases :slight_smile: It looks like second phase
uses some gcc analysis passes (like dominance info, etc).

In order to support OpenMP in llvm-gcc 4.2 we should:
1. Decide at which level we should handle openmp trees in llvm-gcc
2. Define a set of openmp builtins, which will need to be supported in
llc codegen
3. Something I'm not aware of :slight_smile:

If anyone want to work on bringing openmp to llvm-gcc, I'll be happy to
help.

Devang Patel wrote:

cool. Is it listed at http://llvm.org/Users.html ?

Doesn't seem to be. I'll have to prod the guys.

If you're using llvm-gcc-4.0 to get llvm bitcode for analysis then front-end is ignoring OpenMP pragmas. if you're trying to use llvm- gcc-4.2 then most likely it won't work, because llvm bitcode emitter has not been updated to handle OpenMP. If you're getting llvm bitcode (without going through llvm-gcc) and somehow annotating it with OpenMP instructions (I don't know how) then also I CBackend, which is used by llc, won't be able to handle it at the moment.

Thanks for the information. The vague idea was to annotate bitcode
with OpenMP instructions or maybe modify OpenMP instructions
present in the original source. But it seems that this would
not be possible at least in a straightforward fashion.

If you're interested, we welcome contribution on OpenMP front.

We'd be happy to contribute, if something eventually comes out.

Hi,

In the beginning I'd like to say I am not very familiar with (llvm-)gcc
internals, so some of my assumptions and guesses may be wrong. Please,
correct me, then.

Anton Korobeynikov wrote:

Internally OpenMP lowering in gcc is split into two phases. Funny, but
llvm converter is run between these phases :slight_smile: It looks like second phase
uses some gcc analysis passes (like dominance info, etc).

In order to support OpenMP in llvm-gcc 4.2 we should:
1. Decide at which level we should handle openmp trees in llvm-gcc
2. Define a set of openmp builtins, which will need to be supported in
llc codegen
3. Something I'm not aware of :slight_smile:

I've found an interesting paper on the design and implementation of
OpenMP in GCC: http://people.restadhat.com/dnovillo/Papers/gcc2006.pdf
It describes how and when OpenMP pragmas are lowered. It seems the
second phase that Anton wrote about is the most important one. I believe
it is the "pass_expand_omp" mentioned in the paper. Its aim is to
extract parallel sections into new functions, expand directives into
built-in function calls, insert code to calculate iteration-space, etc.
I guess no GIMPLE codes used by OpenMP are left after this pass. If it
is possible to do the conversion to LLVM IR at this stage, the only
thing to do is to handle OpenMP built-ins. The simplest route would be
to replace them with calls to libgomp (an LGPL library accompanying
GCC). It is what GCC seems to do.

The second option is allowing to somehow express OpenMP directives
directly in the LLVM (or some general multiprocessing directives that
OpenMP ones can be lowered to). Then, a pass could be written that
handles them in a similar fashion to "pass_expand_omp" in GCC (see
above). Any frontend (i.e. clang) would have simpler job supporting
OpenMP then. That said, I admit I have no idea how to express these
directives. For example: how to annotate a loop that it should be turned
into a parallel one? I believe keeping the IR as simple as possible is
an important goal, so we would need some lightweight mechanism,
transparent to the existing passes.

I'm personally interested in the subject, because I am trying to develop
an automatic loop parallelization pass for LLVM. Initially, I thought I
would write anything from scratch. However, if there is a decision to
include some support for OpenMP (and similar techniques) lowering in the
LLVM core, I'am ready to help.

--Wojtek

Hi,

Pertti Kellomäki wrote:

Since we are already using LLVM for other purposes, we thought
about using LLVM for analysis and then OpenMP for compilation.
The idea was to use the LLVM back end to spit out C code with
OpenMP directives. However, looking at the C code that llc
produces, it seems that this might be a non-starter, as loops
have already been turned into gotos in the generated C.

I suspect we may need to do something else, but if anyone has
any bright ideas on how to use LLVM for this purpose, I'd be
very grateful.

I have some ideas, but I'm not sure if they are bright...:slight_smile:

As you have noticed, loops aren't represented directly in the LLVM IR.
However, there are analysis passes which may be helpful to "reconstruct"
them. For example: LoopInfo pass detects natural loops (as sets of basic
blocks) and ScalarEvolution pass finds loop induction variables (and
also does many other things). Using them and some own solutions it
should be possible to detect most loops that are potential candidates
for parallelization. Is this what you would like to achieve?

If you know a loop is parallelizable, it is possible, in general, in
LLVM to extract it (its blocks) into a new function. In the place of the
original loop a code spawning some threads calling this new function can
be inserted. Each function call gets different iteration range for the
loop. If you're interested I can send you a very simplified but working
example that demonstrates it (it's about 500 lines).

--Wojtek

This is what I referred to.

Right now, one big missing piece in this puzzle is - dependence analysis.

Devang Patel wrote:

Right now, one big missing piece in this puzzle is - dependence
analysis.

Right. I was only trying to say that it shouldn't be very difficult to
find these for/do loops which are interesting from the parallelization
perspective (in general, not all for/do loops can be reconstructed).

As for the dependence analysis, I need this piece for my project, and I
am preparing myself to write it. I am currently studying some papers,
but haven't yet decided on the method. Maybe some of you have experience
in the area and would like to advise me something? In the next few days
I am going to take a look at the methods based on chains of recurrences
algebra. They seem particularly interesting because chrecs are already
implemented in the scalar evolution pass. Given the limited time I have
for the project, I don't know how precise (and buggy;)) my
implementation will be. However, if it appears to be usable, I'll be
happy to contribute.

--Wojtek

Actually gcc expands things _and_ makes calls into libgomp
so lowering omp first would be the best - otherwise you'll
need to handle the language structures.

-eric

Wojciech,

I've just commited a patch to llvm-gcc 4.2, which moves openmp lowering
stuff to be run little bit earlier, so llvm-convert will catch its
result. It looks now gcc atomic & sync builtins should be introduced to
llvm as a remaining ingredient.

Example program from Diego's paper now compiles to:

@.str = internal constant [10 x i8] c"sum = %d\0A\00" ; <[10 x
i8]*> [#uses=1]

define i32 @main() {
entry:
        %sum = alloca i32 ; <i32*> [#uses=3]
        %.omp_data_o.3 = alloca %struct..omp_data_s.1 ; <%
struct..omp_data_s.1*> [#uses=2]
        store i32 0, i32* %sum, align 4
        %tmp = getelementptr %struct..omp_data_s.1* %.omp_data_o.3, i32
0, i32 0 ; <i32**> [#uses=1]
        store i32* %sum, i32** %tmp, align 8
        %.omp_data_o.31 = bitcast %struct..omp_data_s.1* %.omp_data_o.3
to i8* ; <i8*> [#uses=2]
        call void @GOMP_parallel_start( void (i8*)* @main.omp_fn.0, i8*
%.omp_data_o.31, i32 0 ) nounwind
        call void @main.omp_fn.0( i8* %.omp_data_o.31 )
        call void @GOMP_parallel_end( ) nounwind
        %tmp3 = load i32* %sum, align 4 ; <i32> [#uses=1]
        %tmp5 = call i32 (i8*, ...)* @printf( i8* getelementptr ([10 x
i8]* @.str, i32 0, i32 0), i32 %tmp3 ) nounwind $
        ret i32 undef
}

define internal void @main.omp_fn.0(i8* %.omp_data_i) {
entry:
        %tmp1 = tail call i32 (...)*
@omp_get_thread_num( ) ; <i32> [#uses=1]
        %tmp3 = bitcast i8* %.omp_data_i to i32** ;
<i32**> [#uses=1]
        %tmp4 = load i32** %tmp3, align 4 ; <i32*>
[#uses=1]
        %tmp45 = bitcast i32* %tmp4 to i8* ; <i8*>
[#uses=1]
        %tmp6 = tail call i32 @__sync_fetch_and_add_4( i8* %tmp45, i32 %
tmp1 ) nounwind ; <i32> [#uses=0]
        ret void
}

Anton Korobeynikov wrote:

I've just commited a patch to llvm-gcc 4.2, which moves openmp lowering
stuff to be run little bit earlier, so llvm-convert will catch its
result. It looks now gcc atomic & sync builtins should be introduced to
llvm as a remaining ingredient.

Example program from Diego's paper now compiles to:

Thanks, Anton! The compiled code seems to be ok. IIRC, there is a
project aiming at introducing gcc atomic and sync builtins. It is
temporarily suspended, but when it is eventually finished, OpenMP will
be a nice thing to test it with:)

Kind regards,
Wojtek

As for the dependence analysis, I need this piece for my project, and I
am preparing myself to write it.

Great!

I am currently studying some papers, but haven't yet decided on the method. Maybe some of you have experience in the area and would like to advise me something? In the next few days I am going to take a look at the methods based on chains of recurrences algebra. They seem particularly interesting because chrecs are already implemented in the scalar evolution pass.

Yes, I'd strongly suggest implementing one of the approaches based on the chrec analysis we have in the scev pass.

-Chris

I've just commited a patch to llvm-gcc 4.2, which moves openmp lowering
stuff to be run little bit earlier, so llvm-convert will catch its
result. It looks now gcc atomic & sync builtins should be introduced to
llvm as a remaining ingredient.

Example program from Diego's paper now compiles to:

Very nice Anton!!

-Chris

@.str = internal constant [10 x i8] c"sum = %d\0A\00" ; <[10 x
i8]*> [#uses=1]

define i32 @main() {
entry:
       %sum = alloca i32 ; <i32*> [#uses=3]
       %.omp_data_o.3 = alloca %struct..omp_data_s.1 ; <%
struct..omp_data_s.1*> [#uses=2]
       store i32 0, i32* %sum, align 4
       %tmp = getelementptr %struct..omp_data_s.1* %.omp_data_o.3, i32
0, i32 0 ; <i32**> [#uses=1]
       store i32* %sum, i32** %tmp, align 8
       %.omp_data_o.31 = bitcast %struct..omp_data_s.1* %.omp_data_o.3
to i8* ; <i8*> [#uses=2]
       call void @GOMP_parallel_start( void (i8*)* @main.omp_fn.0, i8*
%.omp_data_o.31, i32 0 ) nounwind
       call void @main.omp_fn.0( i8* %.omp_data_o.31 )
       call void @GOMP_parallel_end( ) nounwind
       %tmp3 = load i32* %sum, align 4 ; <i32> [#uses=1]
       %tmp5 = call i32 (i8*, ...)* @printf( i8* getelementptr ([10 x
i8]* @.str, i32 0, i32 0), i32 %tmp3 ) nounwind $
       ret i32 undef
}

define internal void @main.omp_fn.0(i8* %.omp_data_i) {
entry:
       %tmp1 = tail call i32 (...)*
@omp_get_thread_num( ) ; <i32> [#uses=1]
       %tmp3 = bitcast i8* %.omp_data_i to i32** ;
<i32**> [#uses=1]
       %tmp4 = load i32** %tmp3, align 4 ; <i32*>
[#uses=1]
       %tmp45 = bitcast i32* %tmp4 to i8* ; <i8*>
[#uses=1]
       %tmp6 = tail call i32 @__sync_fetch_and_add_4( i8* %tmp45, i32 %
tmp1 ) nounwind ; <i32> [#uses=0]
       ret void
}

-Chris

Wojciech Matyjewicz wrote:

As for the dependence analysis, I need this piece for my project, and I
am preparing myself to write it. I am currently studying some papers,
but haven't yet decided on the method. Maybe some of you have experience
in the area and would like to advise me something?

I recently read Allen & Kennedy's book Optimizing Compilers for Modern
Architectures - A Dependence Based Approah. It is extermely useful if
you are interested in analysis and transformations for parallelization
(ILP, vectorization, multiprocessing).

<http://www.amazon.com/Optimizing-Compilers-Modern-Architectures-Dependence-based/dp/1558602860/ref=sr_1_9?ie=UTF8&s=books&qid=1196410806&sr=1-9>

One caveat though: look up the errata on the net right from the start,
as there are lots of typos at least in the printing I have.

Hi,

> As for the dependence analysis, I need this piece for my project, and I
> am preparing myself to write it.

Great!

> I am currently studying some papers, but haven't yet decided on the
> method. Maybe some of you have experience in the area and would like to
> advise me something? In the next few days I am going to take a look at
> the methods based on chains of recurrences algebra. They seem
> particularly interesting because chrecs are already implemented in the
> scalar evolution pass.

Yes, I'd strongly suggest implementing one of the approaches based on the
chrec analysis we have in the scev pass.

Yup, just get the array accesses under scev form, and set a constraint
system from these. Once you set up the constraint systems for the
dependence distance vectors, you're almost done. You can look at how
this is set up in gcc trunk tree-data-ref.c, and search for omega.

I would recommend the omega test: get the c++ code for omega lib from
http://www.cs.umd.edu/projects/omega/

Sebastian

Yes, I'd strongly suggest implementing one of the approaches based on the
chrec analysis we have in the scev pass.

Yup, just get the array accesses under scev form, and set a constraint
system from these. Once you set up the constraint systems for the
dependence distance vectors, you're almost done. You can look at how
this is set up in gcc trunk tree-data-ref.c, and search for omega.

I would recommend the omega test: get the c++ code for omega lib from
http://www.cs.umd.edu/projects/omega/

Regardless of which solver you use, you should keep a clean interface between your constraint representation and the solver. I like Omega a lot and it is a fine choice for the first solver. Although it is very powerful, even it cannot handle symbolic loop strides. Perhaps more importantly for common uses, for many trivial cases (A[x] and A[x+1]), setting up the Omega data structures to invoke the solver can be too heavyweight. It could be worthwhile to add one or more simple solvers (see the Allen and Kennedy textbook for examples) before falling back on Omega.

Sebastian
_______________________________________________
LLVM Developers mailing list
LLVMdev@cs.uiuc.edu http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

--Vikram
http://www.cs.uiuc.edu/~vadve
http://llvm.org

Yup, just get the array accesses under scev form, and set a constraint
system from these. Once you set up the constraint systems for the
dependence distance vectors, you're almost done. You can look at how
this is set up in gcc trunk tree-data-ref.c, and search for omega.

I would recommend the omega test: get the c++ code for omega lib from
http://www.cs.umd.edu/projects/omega/

Regardless of which solver you use, you should keep a clean interface
between your constraint representation and the solver. I like Omega a
lot and it is a fine choice for the first solver. Although it is very
powerful, even it cannot handle symbolic loop strides. Perhaps more
importantly for common uses, for many trivial cases (A[x] and A[x+1]),
setting up the Omega data structures to invoke the solver can be too
heavyweight. It could be worthwhile to add one or more simple solvers
(see the Allen and Kennedy textbook for examples) before falling back
on Omega.

Sebastian, Vikram, thank you for the advices.

My initial plan was to implement the methods from the Allen and Kennedy
book. I'll see if I find time to add the Omega solver as a fallback,
too. I am not very familiar with GCC internals, but having taken a look
at tree-data-ref.c, it seems it would be the same approach as there.

Wojtek

Sebastian, Vikram, thank you for the advices.

My initial plan was to implement the methods from the Allen and Kennedy
book. I'll see if I find time to add the Omega solver as a fallback,
too. I am not very familiar with GCC internals, but having taken a look
at tree-data-ref.c, it seems it would be the same approach as there.

That makes sense to me.

If you can add a dependence analysis capability to LLVM, I think it will fill a really important gap because I've seen at least two research projects decide not to use LLVM reportedly for the lack of dependence analysis.

--Vikram
http://www.cs.uiuc.edu/~vadve
http://llvm.org

Sure, and this is exactly what tree-data-ref does (we use a number of
simple solvers, but leave the hard stuff to omega)