Speculative Loop Parallelization on LLVM IR

Hi Javed,

Hi:
I worked on loop-optimizations techniques previously using ORC.
Currently i see lots of research on speculative parallelization of
loops ... specially because multicores [for embedded systems] is
becoming popular. In other words, because you have
multiple cores, you can start some loops [Fast-Track] as if there is no
or low data-dependence [Partial Parallel Loop-Nest].
The normal part is to check later if there was some violation and then
rollback etc. [something like that ... I dont think people want to hear the
full story here].
What do you guys think about the feasibility of "speculative run-time
loop optimization' using LLVM IR as base.
I know there are lots of issues to it -
1. Loop transformation requires tools like PolyLib (Polyhedral
analysis). They are too computationally expensive for
run-time optimization.

There is Polly (still in development)
http://wiki.llvm.org/Polyhedral_optimization_framework

And to say this. The current problem is not very often the computationally complexity. However it might be that we have not yet reached the point, when this gets relevant.

2. I am not clear how to launch threads at LLVM IR level (you do need to
create new threads for some of these RT speculative loop parallelization
techniques).

You could add calls to the posix functions that create a thread to the LLVM-IR or you could use some higher level libraries that support thread creation.

The OpenMP runtime libraries e.g. can help you with this. Here a link to the GNU libaries that could be used to try this approach.
http://gcc.gnu.org/onlinedocs/libgomp/

Is somebody already doing something like this? Does this proposal make
sense even to start...

I do not yet understand what you want to do.

You just want to execute some loops in parallel without knowing if they have data dependencies. And than afterwords you would like to check if there where dependencies.

How would you check for violated dependencies?

Can you give an example of a loop, that would have no dependencies and that you would like to improve?

Tobi

Hi Tobias:

Thanks for replying . So if I understand correctly, in LLVM currently, the Polyhedral model is being built ( LLVM IR -------> Poly Model ----------> LLVM IR ).
This is for compile-time optimizations of loop-nests [e.g. loop-transformations to expose parallelism or improve locality etc]. Yes, thats great for optimizing loop-nests.

As an additional, since the real value of LLVM to me is run-time optimizations possibilities, some loop-nest optimizations “needs” to be done at run-time.
I say “needs” because many loops have unresolvable data-dependencies at compile-time
[for example

FOR i = 1 to N
A[ B[i] ] = A [ i ] + func (…)
END

Can the iterations of this loop be run in parallel?
]

There are two approaches for with the above kind of problems. One is inspector/executor, other is speculation.
To keep the story short - basically, you run the loop-iterations in parallel and verify in the end if data-dependencies were violated. If yes, you rewind and run the loop sequentially.
If for that particular case there was no data-dependencies violated, you have gained in execution time [yes, there is cost involved in verifying and nett gain is not always +ve ].

OK, so whats my point? To be able to do at least some loop-transformations at run-time to expose parallelism etc, perhaps some kind of LLVM IR → Poly —> LLVM IR
support at run-time may be required. Definitely a scaled down version, since polyhedral transformations need a lot of processing in my opinion.
So if i understand, we are not there yet … and may be i can come back with some proposal/ideas and cross-check it with you guys.

Thanks
Javed

Yes you are right. I also believe it is a very interesting area of research to check if/howmuch optimizations can be done at runtime.

This is one of the reasons I built polly as I am planning to use its JIT facilities explore the possible optimizations at runtime.

Please let me now about any ideas you have in this direction, as I am highly interested in these topics.

Cheers
Tobi

Hi Javed,

This is quite interesting, but there are many dangers that can be
easily forgotten. First and most obvious, you'll have to put all
computation into account when comparing if there was a net gain, not
only the vectorized execution. It might not be trivial to calculate
the gain if you don't have control of the cycles or if the pipeline is
too deep. In big loops, it makes little difference, but there will be
a threshold (depending on machine configurations etc) that there will
be a big hit on performance, especially on program start-up, which is
the most critical to the user.

Another phantom is that, when dealing with pointers (nor arrays, as
you suggest), the locality is relative and depend on *each* execution.
You might tage the loop as "safe" if at first it seems so, but you
still have to check *every* time for disparities not to get caught in
that trap. That also add to the "extra cost" this mechanism brings.

Last, as you said, the poly is very expensive. Stripped down versions
might not be suitable to analyse all cases, so you might end up with a
handful of sub-versions, and that adds time and space complexity to
the program. This might also be running on a JITed environment, which
also adds to the invisible cost, with unpredictable cache hits, memory
usage etc.

My tuppence.

cheers,
--renato

In cases where it matters, most compilers will perform function
versioning and include data dependence if checks to check at runtime
which version (parallel or non) can be run.

Then the JIT can simply eliminate the branches it proves are live/dead
and you are left with either the parallelized or not version.
As you say, this is the standard approach, why would you need to run
Poly at runtime?

Hi Tobias:

Thanks for replying . So if I understand correctly, in LLVM currently, the
Polyhedral model is being built ( LLVM IR -------> Poly Model ---------->
LLVM IR ).
This is for compile-time optimizations of loop-nests [e.g.
loop-transformations to expose parallelism or improve locality etc]. Yes,
thats great for optimizing loop-nests.

As an additional, since the real value of LLVM to me is run-time
optimizations possibilities, some loop-nest optimizations "needs" to be done
at run-time.
I say "needs" because many loops have unresolvable data-dependencies at
compile-time

In cases where it matters, most compilers will perform function
versioning and include data dependence if checks to check at runtime
which version (parallel or non) can be run.

Then the JIT can simply eliminate the branches it proves are live/dead
and you are left with either the parallelized or not version.

May be he is thinking about the cases where dependencies can not be
fully resolved until the loop is run ?

Do you really need to do transformations at run time ?

http://www.springerlink.com/content/ly5t6f8fyyr28m8y/

:slight_smile:

Then the JIT can simply eliminate the branches it proves are live/dead
and you are left with either the parallelized or not version.

May be he is thinking about the cases where dependencies can not be
fully resolved until the loop is run ?

What percent of loops is this, and how often do they occur?
Also, most of these models can insert all the guards, even for things
that cannot be resolved until the loop is run, they just end up
generating a lot more guards, no?
It's like runtime alias checks.