multi-threading in llvm

Hi,

I want to execute the iterations of a loop in parallel, by inserting calls either to pthreads or to the gomp library at the LLVM IR level. As a first step, I inserted an omp pragma in a C file and compiled it with llvm-gcc to check the generated LLVM code. If I understand correctly, to parallelize the loop in LLVM IR, I have to separate the loop in a new function, put all required parameters in a structure, make the call to the gomp library, and restore all values from the structure in the original variables.
Also, I have to compute the number of iterations allocated to each thread and insert in the loop body a new condition, such that each thread executes only its slice.

Is that correct?

As far as I know, both llvm-gcc and Polly already offer support for OpenMP, by inserting calls to the gomp library. Can this code be reused? Is there a pass that I can call to do all these code transformations?

I had a look at the CodeGeneration from Polly. Is it possible to use it without creating the Scops, by transforming it into a LoopPass?
Could you indicate how is this handled in llvm-gcc?

Thank you for your help.

Alexandra.

Hi Alexandra,

I don't know much, maybe this topic should be bridged with polly-dev
(adding it to CC) to bring it more attention.

Indeed, polly uses ScopPass, that creates serious limitations in
compatibility with other passes. To my understanding, scops are used
because ISL loop analysis tool uses scops. In fact, just for handling
OpenMP directives scops are not required, unless one need to make sure
OpenMP directive is set for loop with parallel iterations.

Btw, it would be very interesting to know more about your
project/purpose for this!

- D.

Hi Alexandra,

I had a look at the CodeGeneration from Polly. Is it possible to use it
without creating the Scops, by transforming it into a LoopPass?

Yes. If you don't want to use the autopar of Polly and just rely on
the pragmas inserted by the programmer, you don't need SCoPs.
You can have the parallel code generation part of Polly working as
a LoopPass.

If you want to use the auto-parallelization part of Polly, you have to
to restrict the form of the code on which data dependences can be
computed: this is done by using SCoPs (Static Control Parts).

Sebastian

Hi Alexandra,

I don't know much, maybe this topic should be bridged with polly-dev
(adding it to CC) to bring it more attention.

Thanks Dimitry for moving this over. I would also have replied on the LLVM list, but was away this weekend. Now I am back.

Indeed, polly uses ScopPass, that creates serious limitations in
compatibility with other passes. To my understanding, scops are used
because ISL loop analysis tool uses scops.

Scops are used, as this is the problem domain we are working on. Scops are not just loops, but can also be nested conditions without any loops.
Also describing them as loop passes is not what we want, as the idea of our polyhedral description is actually to abstract away the specific structure of the loops.
I would be very interested to hear what use cases you believe are limited because of the use of ScopPass? If you are e.g. only interested in Dependency Information or Alias Information, it might be possible to
create a LoopPass that proxies the relevant information, such that it is available to other passes.

> In fact, just for handling

OpenMP directives scops are not required, unless one need to make sure
OpenMP directive is set for loop with parallel iterations.

Right. Scops are not needed to handle OpenMP directives and Polly is actually not ment to handle OpenMP directives. It is just one of several ways to introduce OpenMP parallelism. Polly does this if the loop it generate is parallel, clang would do it if the user added OpenMP directives and Alex may introduce parallelism in similar cases.

For me handling 'OpenMP directives' could mean two things:
1) clang understand OpenMP pragmas and lowers them to a set of OpenMP intrinsics/function calls and structures.

2) An LLVM optimization pass understands a set of high level OpenMP intrinsics that it can optimize and transform to specific libgomp or mpc.sf.net library calls.

Both are not yet available in LLVM, but would be highly useful. Especially 2) would be nice for Polly and probably also for Alexandra.

Btw, it would be very interesting to know more about your
project/purpose for this!

Oh yes. I am also highly interested.

Hi,

I want to execute the iterations of a loop in parallel, by inserting calls
either to pthreads or to the gomp library at the LLVM IR level. As a first
step, I inserted an omp pragma in a C file and compiled it with llvm-gcc to
check the generated LLVM code.

This is a good step. When you do this, make sure you use
'schedule (runtime)'. Otherwise gcc will use not only function calls to set up libgomp, but also inline instructions. This makes the code a lot harder to understand.

>> If I understand correctly, to parallelize the

loop in LLVM IR, I have to separate the loop in a new function, put all
required parameters in a structure, make the call to the gomp library, and
restore all values from the structure in the original variables.
Also, I have to compute the number of iterations allocated to each thread
and insert in the loop body a new condition, such that each thread executes
only its slice.
Is that correct?

Partially. And it depends also on what kind of OpenMP parallel loop you want to generate. I suggest you want to generate a schedule(runtime) OpenMP parallel loop, as this is the easiest one to generate. Here you need basically do this:

Host function:
(The function that contains the loop you want to parallelize)

Here you replace the loop with calls to:

GOMP_parallel_loop_runtime_start(subfunction, subfunction_data,
                                  number_of_threads, lower_bound,
                                  upper_bound, stride)
subfunction()
GOMP_parallel_end()

subfunction is the address of a new function, called subfunction. subfunction_data, is the address of the structure that contains the data needed in the subfunction. The remaining arguments should be obvious.

subfunction is now basically:

int lower_bound, upper_bound;

while(GOMP_loop_runtime_next(*lower_bound, upper_bound)) {

  for (int i = lower_bound; i < upper_bound; i += stride) {
           // Put here your loop body
         }
}

As far as I know, both llvm-gcc and Polly already offer support for OpenMP,
by inserting calls to the gomp library.

Polly support automatic parallelization of the loops it generates. gcc (and therefore llvm-gcc) supports both user added OpenMP pragmas and automatically added OpenMP calls (provided by the -autopar pass).

> Can this code be reused?
Depends on what you plan to do. The code in Polly is currently specific to Polly , only creates the calls to OpenMP that we need and basically builds an OpenMP loop from scratch.

What you most probably want is a pass, that takes an existing LLVM-IR loop and translates it into an OpenMP parallel loop.

From Polly you could get the functions that create the definitions of the relevant LibGOMP functions, that set up the functions and that create the new loop structure. To get a pass that translates an LLVM-IR to LLVM-IR loop, you still need to implement the actual transformation.

Is there a pass that I can call to do all these code transformations?

No. LLVM has no included passes for OpenMP transformations. Polly just does code generation for its own, specific use case. gcc has some passes that lower high-level OpenMP calls to more individual OpenMP calls and calculations.

I had a look at the CodeGeneration from Polly. Is it possible to use it
without creating the Scops, by transforming it into a LoopPass?

No. It is not a pass that translates from LLVM-IR to LLVM-IR, but it creates new LLVM-IR from a polyhedral description. So without Scops and therefore without a polyhedral description it cannot be used.

Could you indicate how is this handled in llvm-gcc?

The only pass that generates OpenMP calls in llvm-gcc is the gcc -autopar pass. It basically detects if a loop is parallel and introduces calls to libgomp (I am not sure if it creates higher level intrinsics that are lowered by a subsequent gcc pass to the actual libgomp calls or equivalent instructions).

Alex, let me know what you are planning to do. Maybe we can work together on some OpenMP infrastructure in LLVM. I would love to have both a generic OpenMP builder that can be used by both your transformations and Polly, as well as a Pass that lowers high level OpenMP intrinsics to higher performance OpenMP code and low level function calls.

Cheers
Tobi

Hi Alexandra,

I had a look at the CodeGeneration from Polly. Is it possible to use it
without creating the Scops, by transforming it into a LoopPass?

Yes. If you don't want to use the autopar of Polly and just rely on
the pragmas inserted by the programmer, you don't need SCoPs.
You can have the parallel code generation part of Polly working as
a LoopPass.

Are you sure about this? In CodeGeneration we basically translate a CLooG AST into LLVM-IR. Without a CLooG AST this does not work.

If you want to use the auto-parallelization part of Polly, you have to
to restrict the form of the code on which data dependences can be
computed: this is done by using SCoPs (Static Control Parts).

In Polly we just generate OpenMP calls, if we need them. We do not provide a generic OpenMP infrastructure (Something nice to have in LLVM and where I would love to contribute). However, such a generic infrastructure has a broader SCoP than the single specific OpenMP pattern we implemented in Polly.

Cheers
Tobi

I mean you could rewrite that code to work on a per loop basis, like
graphite does: flag the parallel loops in the code generation of polly
and then independently of polly, as a LoopPass, iterate over all the
loops and code generate the parallel ones.

Sebastian

Yes, that's true and probably a good idea. It is more work as generating it directly in Polly, but it facilitates reuse of code. We can probably reuse quite a bit of the code in the existing OpenMP code generation.

We (or actually someone) may work on three things:

1) The high level OpenMP intrinsics that can be lowered to actual library calls

This can be reused by an clang OpenMP extensions as well as by polly directly.

2) An OpenMP builder, that allows to build OpenMP loops from scratch

3) The actual parallelizer, that takes an LLVM-IR loop and parallelizes it with OpenMP intrinsics.

Cheers
Tobi

Recently I met the following spec:

It was new to me OpenMP actually allows to specify "the number of
threads to use for the corresponding nested level" in the
multidimensional loops. Is that correct in Polly this feature is not
supported, and OpenMP clause is always applied only the the most outer
loop?

With it supported I think we can step much closer to natural
unification of OpenMP and multidimensional compute grids of GPU
programming models.

Thanks for your ideas.

Polly looked interesting as it seems it has already introduced the basic support for OpenMP that I need. But indeed, it seems difficult to apply the same code on a regular LLVM loop, instead of a SCoP.

What I am working on is speculative parallelism, so I cannot provide the SCoPs required by Polly. I analyze the loops and try to parallelize them speculatively at the LLVM IR level. So I am searching for the easiest (and possibly fastest) way to create parallel code. I was not planning to reimplement the OpenMP support you provide, as I was hoping to reuse it. Maybe we can further discuss which solution is better and how difficult it is to adapt the parallel code generation from Polly for regular loops.

What I am looking for is a pass that can automatically build the structure of parameters required for the thread creation and execution, and modify the loop code accordingly. Namely, to load the parameters necessary for the thread into a structure and replace the use of these values inside the loop with the corresponding field in the structure. Similarly, once the threads finish the execution to update the values of the variables with the correct values from the structure. Also, the loop code has to be modified to execute only the slice allocated to the thread. I did not go yet into further details on the private and shared variables between the threads and other OpenMP features.

As far as I know this code is not yet available in LLVM. Do you think the parallel code generation of Polly can be adapted to perform this? My work is not focused on adding OpenMP support in LLVM, but I was looking for codes that already enable parallelism in the LLVM IR. Surely, if necessary I will work on building the code and inserting calls to the gomp library, to parallelize regular loops. In that case, we can discuss the best approach to stay general and to meet all our requirements.

Alexandra

Recently I met the following spec:
OMP_NUM_THREADS (GNU libgomp)
It was new to me OpenMP actually allows to specify "the number of
threads to use for the corresponding nested level" in the
multidimensional loops. Is that correct in Polly this feature is not
supported, and OpenMP clause is always applied only the the most outer
loop?

Jep, it is not used by Polly.

With it supported I think we can step much closer to natural
unification of OpenMP and multidimensional compute grids of GPU
programming models.

I would be very interested to see a use case, where this improves performance. In case this use case appears frequently, it would be nice to see work in this direction. I would be glad to contribute to relevant discussions.

Cheers
Tobi

Thanks for your ideas.

Polly looked interesting as it seems it has already introduced the basic
support for OpenMP that I need. But indeed, it seems difficult to apply
the same code on a regular LLVM loop, instead of a SCoP.

What I am working on is speculative parallelism, so I cannot provide the
SCoPs required by Polly. I analyze the loops and try to parallelize them
speculatively at the LLVM IR level. So I am searching for the easiest
(and possibly fastest) way to create parallel code. I was not planning
to reimplement the OpenMP support you provide, as I was hoping to reuse
it. Maybe we can further discuss which solution is better and how
difficult it is to adapt the parallel code generation from Polly for
regular loops.

Sure, let's do this. We can also have a chat this Friday on Euro-LLVM.

What I am looking for is a pass that can automatically build the
structure of parameters required for the thread creation and execution,
and modify the loop code accordingly. Namely, to load the parameters
necessary for the thread into a structure and replace the use of these
values inside the loop with the corresponding field in the structure.
Similarly, once the threads finish the execution to update the values of
the variables with the correct values from the structure. Also, the loop
code has to be modified to execute only the slice allocated to the
thread. I did not go yet into further details on the private and shared
variables between the threads and other OpenMP features.

I would also like to have such a pass. It is more complicated as the code in Polly right now, but most probably worth the effort.

As far as I know this code is not yet available in LLVM. Do you think
the parallel code generation of Polly can be adapted to perform this?

Yes and No. I think we should be able to extract the generic functions to build the new OpenMP loop structure, the parameter structures and the function calls. With those we could create an OpenMPBuilder as we today have an LLVM-IR builder.

Based on this, building a pass that translates a regular LLVM-IR loop into a OpenMP parallelized one should not be too difficult. (Especially, as you perform speculative parallelization and you do not need any dependency analysis)

> My

work is not focused on adding OpenMP support in LLVM, but I was looking
for codes that already enable parallelism in the LLVM IR. Surely, if
necessary I will work on building the code and inserting calls to the
gomp library, to parallelize regular loops. In that case, we can discuss
the best approach to stay general and to meet all our requirements.

Yes, I assume there is no finished solution available. Though, I am interested in work to get one.

Cheers
Tobi

Hi,

It was proposed to implement OpenMP framework.

Is there any progress in this area? Was the question raised at the
Euro-LLVM? Does anybody work on implementation?

I am not aware of any work that was done or is planned to be done in public.

At Euro-LLVM I had a short discussion with Alexandra about targeting the OpenMP run time. We would both be interested in code that can translate normal LLVM-IR loops in OpenMP parallel loops. I am not sure,
if Alexandra has some code that she can share. I for myself looked into
the problem. I believe the loop extractor is a good start to implement such functionality. Yet, I do not have any written code ready.

Do you happen to be interested to work on this? :wink: I can definitely give you some hints.

Cheers
Tobi

I am interested. I would be grateful for your hints.

So OpenMP has various constructs such as parallel, barrier, single,
for, etc. And there is at least two libraries to generate OpenMP code:
libgomp and mpc. We want to be independent of specific library.

We should create an interface with methods which present manual
inserting of OpenMP pragmas in C or Fortran code. These sounds like
"this code block should be run with #parallel" or "this for loop
should be run with #parallel for loop". Constructs like barrier,
atomic can be simply presented by methods 'barrier' and 'atomic'. But
parallel or for constructs allows many modifiers (schedule, private,
shared, etc). óan not devise a good solution for these cases.

Guess this interface is close to IRBuilder. You suggested it will be
OpenMPBuilder. This mean IR will be created from scratch. Other way is
to provide methods to modify existing IR, like user says "I want this
loop to be parallel, do it for me". What you think about this?

Implementation is calls of OpenMP library functions in a way provided
by the library abi. For example, in mpc:

   pragma omp for schedule(runtime)
   for ( i=lb ; i COND b ; i+=incr ) { A ; }
   ->
   if ( __mpcomp_runtime_loop_begin( ... ) ) {
     do {
       for ( ... ) {
         A ;
       }
     } while ( __mpcomp_runtime_loop_next( ... ) ) ;
   }
   __mpcomp_runtime_loop_end() ;

I think this work is not simple for me, so any comments are welcomed.

Hi,

I plan to work on parallelizing loops in LLVM in the near future. For the moment, the plan is to transform sequential loops into parallel loops at the IR level, by modifying their structure and inserting gomp calls. I do not plan to work on all structures available in OpenMP, for now, we focus only on loops. However,

if our work seems to be similar, we can keep in touch and exchange ideas. I’ll be back with more details when I start the implementation.

Alexandra

I am interested. I would be grateful for your hints.

Great. :wink:

So OpenMP has various constructs such as parallel, barrier, single,
for, etc. And there is at least two libraries to generate OpenMP code:
libgomp and mpc. We want to be independent of specific library.

True.

We should create an interface with methods which present manual
inserting of OpenMP pragmas in C or Fortran code. These sounds like
"this code block should be run with #parallel" or "this for loop
should be run with #parallel for loop". Constructs like barrier,
atomic can be simply presented by methods 'barrier' and 'atomic'. But
parallel or for constructs allows many modifiers (schedule, private,
shared, etc). óan not devise a good solution for these cases.

I think we should distinguish here how to represent the OpenMP constructs in LLVM-IR and how to create such LLVM-IR.

Guess this interface is close to IRBuilder. You suggested it will be
OpenMPBuilder. This mean IR will be created from scratch. Other way is
to provide methods to modify existing IR, like user says "I want this
loop to be parallel, do it for me". What you think about this?

I think both are valid methods to create LLVM-IR that uses OpenMP extensions.

Alexandra and other people interested in automatic parallelization are probably more interested in transforming a plain LLVM-IR loop into LLVM-IR with OpenMP extensions. To add OpenMP annotation support to clang, the clang code generation would probably be better off by directly generating LLVM-IR with OpenMP constructs.

Implementation is calls of OpenMP library functions in a way provided
by the library abi. For example, in mpc:

    pragma omp for schedule(runtime)
    for ( i=lb ; i COND b ; i+=incr ) { A ; }
    ->
    if ( __mpcomp_runtime_loop_begin( ... ) ) {
      do {
        for ( ... ) {
          A ;
        }
      } while ( __mpcomp_runtime_loop_next( ... ) ) ;
    }
    __mpcomp_runtime_loop_end() ;

I think this work is not simple for me, so any comments are welcomed.

I have some ideas, but as this is a larger project, we should probably get a wider audience to review such a proposal. (You may also want to copy the mpc authors, who have some experience in how to add support for OpenMP into a compiler IR.

Some ideas from my side:

How to represent OpenMP in LLVM?

I do not believe OpenMP needs direct LLVM-IR extensions, but should rather be represented through a set of intrinsics. The intrinsics may be similar to function calls generated with OpenMP SCHEDULE (runtime),
but should be runtime library independent. They should also provide enough information to optimize them within LLVM, in case the necessary information is already available at compile time.

How to translate LLVM OpenMP intrinsics to actual code

Library specific passes can be used to lower the intrinsics to the specific libgomp or mpc library calls. If already possible at compile time, fast LLVM-IR sequences can be inlined instead of calling actual library functions.

  pragma omp for schedule(runtime)
  for ( i=lb ; i COND b ; i+=incr ) { A ; }

-> Represented as LLVM-IR

  def original_function() {
    llvm.openmp.loop(lower_bound, upper_bound, body, context, schedule)
  }

  def body(i, context) {
    A(i);
  }

-> Lowering to MPC: (for schedule = runtime)

  if ( __mpcomp_runtime_loop_begin( ... ) ) {
    do {
      for ( ... ) {
        A ;
      }
    } while ( __mpcomp_runtime_loop_next( ... ) ) ;
  }
  __mpcomp_runtime_loop_end() ;

-> Lowering to MPC: (for schedule = static)

Some fast LLVM-IR implementations of __mpc_static_loop_begin() may be called and inlined.

How to generate the OpenMP intrinsics?

There are two ways:

* People implement the OpenMP frontend stuff in clang and clang can directly emit the relevant intrinsics when generating LLVM-IR.

* An automatic parallelizer generates code by transforming existing LLVM-IR. Here projects like Polly or Alexandra's tools come in.

Vlad:

If you interested to work on such a thing, I propose you start by writing up a small proposal. This may contain the general architecture of all this, a proposal of how to represent OpenMP at LLVM-IR level and
everything else you think is important. You can than put this proposal for discussion on the LLVM mailing list. I and hopefully several other people will be glad to give our opinion.

Cheers
Tobi

> I am interested. I would be grateful for your hints.

Great. :wink:

> So OpenMP has various constructs such as parallel, barrier, single,
> for, etc. And there is at least two libraries to generate OpenMP code:
> libgomp and mpc. We want to be independent of specific library.

True.

> We should create an interface with methods which present manual
> inserting of OpenMP pragmas in C or Fortran code. These sounds like
> "this code block should be run with #parallel" or "this for loop
> should be run with #parallel for loop". Constructs like barrier,
> atomic can be simply presented by methods 'barrier' and 'atomic'. But
> parallel or for constructs allows many modifiers (schedule, private,
> shared, etc). Сan not devise a good solution for these cases.

I think we should distinguish here how to represent the OpenMP
constructs in LLVM-IR and how to create such LLVM-IR.

> Guess this interface is close to IRBuilder. You suggested it will be
> OpenMPBuilder. This mean IR will be created from scratch. Other way is
> to provide methods to modify existing IR, like user says "I want this
> loop to be parallel, do it for me". What you think about this?

I think both are valid methods to create LLVM-IR that uses OpenMP
extensions.

Alexandra and other people interested in automatic parallelization are
probably more interested in transforming a plain LLVM-IR loop into
LLVM-IR with OpenMP extensions. To add OpenMP annotation support to
clang, the clang code generation would probably be better off by
directly generating LLVM-IR with OpenMP constructs.

> Implementation is calls of OpenMP library functions in a way provided
> by the library abi. For example, in mpc:
>
> pragma omp for schedule(runtime)
> for ( i=lb ; i COND b ; i+=incr ) { A ; }
> ->
> if ( __mpcomp_runtime_loop_begin( ... ) ) {
> do {
> for ( ... ) {
> A ;
> }
> } while ( __mpcomp_runtime_loop_next( ... ) ) ;
> }
> __mpcomp_runtime_loop_end() ;
>
>
> I think this work is not simple for me, so any comments are welcomed.

I have some ideas, but as this is a larger project, we should probably
get a wider audience to review such a proposal. (You may also want to
copy the mpc authors, who have some experience in how to add support for
OpenMP into a compiler IR.

FWIW, there is a BSD-licensed OpenMP 3 source-to-source translator
available as part of the ROSE project: http://www.rosecompiler.org/
It might be worth looking at.

Some ideas from my side:

How to represent OpenMP in LLVM?

I do not believe OpenMP needs direct LLVM-IR extensions, but should
rather be represented through a set of intrinsics. The intrinsics may be
similar to function calls generated with OpenMP SCHEDULE (runtime),
but should be runtime library independent. They should also provide
enough information to optimize them within LLVM, in case the necessary
information is already available at compile time.

How to translate LLVM OpenMP intrinsics to actual code

Library specific passes can be used to lower the intrinsics to the
specific libgomp or mpc library calls. If already possible at compile
time, fast LLVM-IR sequences can be inlined instead of calling actual
library functions.

I agree, the ability to have target-specific lowering of the
parallelization constructs will be very important for some platforms. We
may need some new attributes in the IR so that we can still safely do
things like LICM.

-Hal

Hi all,

I'd like to put in some effort into this too -- perhaps I can write an
backend for libgomp while someone else works on a libmpc one.

As far as the architecture is concerned, I concur with what has
already been discussed: mapping OpenMP constructs to LLVM intrinsics.
I think it would make sense to leave out intrinsics for things like
"parallel for", "parallel sections", which are essentially syntactic
sugar.

One way to denote a structured block could be by nesting it within two
intrinsics like

llvm.openmp.parallel_begin (llvm.openmp.clause_if(%1) <and so forth>)
  body
llvm.openmp.parallel_end()

or we could pass in an end label to parallel_begin (we can then call
it parallel instead of parallel_begin). I'm not sure which one is the
better idea. There will have to be restrictions on the body, of
course. We can't allow control to jump outside body without
encountering the barrier, for instance.

The more difficult problem would be, I think, to lift the structured
block and create a function out of it that takes a closure.

I also think there is potential for optimizing things like the
reduction clause and the atomic construct by lowering them directly
into CAS or other machine instructions. But I'll stop here before I
get ahead of myself. :slight_smile:

Thanks!

Hi all,

I'd like to put in some effort into this too -- perhaps I can write an
backend for libgomp while someone else works on a libmpc one.

As far as the architecture is concerned, I concur with what has
already been discussed: mapping OpenMP constructs to LLVM intrinsics.
I think it would make sense to leave out intrinsics for things like
"parallel for", "parallel sections", which are essentially syntactic
sugar.

I agree.

One way to denote a structured block could be by nesting it within two
intrinsics like

I am not sure how well this will work in practice, but we should come
up with a plan. As far as I can tell, we have two options: intrinsics
and metadata; we may want to use some combination of the two. One key
issue is to decide how optimization passes will interact with the
parallelization constructs. I think that it is particularly important
that using OpenMP does not preclude loop unrolling and LICM, for
example.

As you mention, we'll also want to provide the ability to do
target-specific lowering of certain constructs (like atomics). We also
want to make sure that we can recognize and optimize expressions that
are constant for a given thread within a parallel region (like a
get-thread-id operation).

Thoughts?

-Hal