[RFC] Parallelization metadata and intrinsics in LLVM

Hi Hal

I was also looking at providing such a support in LLVM for capturing (both explicit and implicit)

parallelism in LLVM. We had an initial discussion around this and your proposal comes at the

right time. We support such an initiative. We can work together to get this support implemented

in LLVM.

But, I have a slight different view. I think today parallelism does not necessarily mean OpenMP

or SIMD, we are in the area of heterogeneous computing. I agree that your primary target

was thread-based parallelism, but I think we could extend this while we capture the parallelism

in the program.

My idea is to capture parallelism with the way you have said using ‘metadata’. I agree to record

the parallel regions in the metadata (as given by the user). However, we could also give placeholders

to record any additional information that the compiler writer needs like number of threads,

scheduling parameters, chunk size, etc etc which are specific perhaps to OpenMP.

The point is that the same parallel loop could be targeted by another standard to accelerators today

(like GPUs) using another standard OpenACC. We may get a new standard to capture and target

for different kind of parallel device, which could look quite different, and has to specifically targeted.

Since we are at the intermediate layer, we could be independent of both user level standards like

OpenMP, OpenACC, OpenCL, Cilk+, C++AMP etc and at the same time, keep enough information at this stage

so that the compiler could generate efficient backend code for the target device. So, my suggestion is

to keep all these relevant information as ‘tags’ for metadata and it is up to the backend to use or

throw the information. As you said, if the backend ignores there should not be any harm in correctness

of the final code.

Second point I wanted to make was on the intrinsics. I am not sure why we need these intrinsics at the

LLVM level. I am not sure why we would need conditional constructs for expressing parallelism. These

could be calls directly to the runtime library at the code generation level.

Again, this is very good initiative and we would like to support such a support in LLVM ASAP.

Prakash Raghavendra

AMD, Bangalore

Email: Prakash.raghavendra@amd.com

Phone: +91-80-3323 0753

Hi Prakash,

I can't see the silver bullet you do. Different types of parallelism
(thread/process/network/heterogeneous) have completely different
assumptions, and the same keyword can mean different things, depending
on the context. If you try to create a magic metadata that will cover
from OpenCL to OpenMP to MPI, you'll end up having to have namespaces
in metadata, which is the same as having N different types of
metadata.

If there was a language that could encompass the pure meaning of
parallelism (Oracle just failed building that), one could assume many
things for each paradigm (OpenCL, mp, etc) much easier than trying to
fit the already complex rules of C/C++ into the even more complex
rules of target/vendor-dependent behaviour. OpenCL is supposed to be
less of a problem in that, but the target is so different that I
wouldn't try to merge OpenCL keywords with OpenMP ones.

True, you can do that with a handful of basic concepts, but the more
obscure ones will break your leg. And you *will* have to implement
them. Those of us unlucky enough to have to have implement bitfields,
anonymous unions, volatile and C++ class layout know what I mean by
that.

True, we're talking about the language-agnostic LLVM IR, but you have
to remember that LLVM IR is build from real-world languages, and thus,
full of front-end hacks and fiddles to tell the back end about the ABI
decisions in a generic way.

I'm still not convinced there will be a lot of shared keywords between
all parallel paradigms, ie. that you can take the same IR and compile
to OpenCL, or OpenMP, or MPI, etc and it'll just work (and optimise).

Hi Hal

I was also looking at providing such a support in LLVM for capturing
(both explicit and implicit) parallelism in LLVM. We had an initial
discussion around this and your proposal comes at the right time. We
support such an initiative. We can work together to get this support
implemented in LLVM.

Great!

But, I have a slight different view. I think today parallelism does
not necessarily mean OpenMP or SIMD, we are in the area of
heterogeneous computing. I agree that your primary target was
thread-based parallelism, but I think we could extend this while we
capture the parallelism in the program.

I don't think that we have a different view, but my experience with
heterogeneous systems is limited, and while I've played around with
OpenACC and OpenCL some, I don't feel qualified to design an LLVM
support API for those standards. I don't feel that I really understand
the use cases well enough. My hope is that others will chime in with
ideas on how to best support those models.

I think that the largest difference between shared-memory parallelism
(as in OpenMP) and the parallelism targeted by OpenACC, etc. is the
memory model. With OpenACC, IIRC, there is an assumption that the
accelerator memory is separate and specific data-copying directives are
necessary. Furthermore, with asynchronous-completion support, these
data copies are not optional. We could certainly add data-copying
intrinsics for this, but the underlying problem is code assumptions
about the data copies. I'm not sure how to deal with this.

My idea is to capture parallelism with the way you have said using
'metadata'. I agree to record the parallel regions in the metadata
(as given by the user). However, we could also give placeholders to
record any additional information that the compiler writer needs like
number of threads, scheduling parameters, chunk size, etc etc which
are specific perhaps to OpenMP.

I agree, although I think that some of those parameters are generic
enough to apply to different parallelization mechanism. They might also
be ignored by some mechanisms for which they're irrelevant. We should
make the metadata modular, I think that is a good idea. Instead of
taking a fixed list of things, for example, we may want to encode
name/value pairs.

The point is that the same parallel loop could be targeted by another
standard to accelerators today (like GPUs) using another standard
OpenACC. We may get a new standard to capture and target for
different kind of parallel device, which could look quite different,
and has to specifically targeted.

Yes. We just need to make sure that we fully capture the semantics of
the standards that we're targeting. My idea was to start with OpenMP,
and make sure that we could fully capture its semantics, and then move
on from there.

Since we are at the intermediate layer, we could be independent of
both user level standards like OpenMP, OpenACC, OpenCL, Cilk+, C++AMP
etc and at the same time, keep enough information at this stage so
that the compiler could generate efficient backend code for the
target device.

Yes, this is, to the extent possible, what I'd like.

So, my suggestion is to keep all these relevant
information as 'tags' for metadata and it is up to the backend to use
or throw the information. As you said, if the backend ignores there
should not be any harm in correctness of the final code.

Second point I wanted to make was on the intrinsics. I am not sure
why we need these intrinsics at the LLVM level. I am not sure why we
would need conditional constructs for expressing parallelism. These
could be calls directly to the runtime library at the code generation
level.

These are necessary because of technical requirements; specifically,
metadata variable references do not count as 'uses', and so were
runtime expressions not referenced by an intrinsic, those variables
would be deleted as dead code. In OpenMP, expressions which reference
local variables can appear in the pragmas (such as those which specify
the number of threads), and we need to make sure those expressions are
not removed prior to lowering. I believe that OpenACC has similar
causes to support.

That having been said, I'm certainly open to more generic intrinsics.

Again, this is very good initiative and we would like to support such
a support in LLVM ASAP.

I am very happy to hear you say that.

-Hal

Renato,

To some extent, I'm not sure that the keywords are the largest problem,
but rather the runtime libraries. OpenMP, OpenACC, Cilk++, etc. all
have runtime libraries that provide functions that interact with the
respective syntax extensions. Allowing for that in combination with a
generic framework might be difficult.

That having been said, from the implementation side, there are
certainly commonalities that we should exploit. Basic changes to
optimization passes (loop iteration-space changes, LICM, etc.),
aliasing analysis, etc. will be necessary to support many kinds of
parallelism, and I think having generic support for that in LLVM is
highly preferable to specialized support for many different standards.
I think that there will be standard-specific semantics that will need
specific modeling, but we should share when possible (and I think that
a lot of the basic infrastructure can be shared).

-Hal