[RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)

Hello,

I'd like to see support in clang/LLVM for multi-core parallelism,
especially support for OpenMP. I think that the best way to do this is
by designing an LLVM-based API (metadata and intrinsics) for
expressing parallelism constructs, and having clang lower OpenMP code
to that API. This will allow maximal preservation of optimization
capabilities including target-specific lowering. What follows outlines
a set of metadata and intrinsics which should allow support for the
full OpenMP specification, and I'd like to know what the community
thinks about this.

As a general note: My intent here is to make the metadata safe in the
traditional sense: it can be removed by optimization passes that don't
understand it, and while this might result in the loss of the
parallelization, the removal will not be otherwise unsafe. I believe
that many existing passes will require minor modification in order to
preserve the metadata as appropriate, but I think these changes are
relatively small. In addition, the authors of passes that preserve
parallelization by dealing with parallelization metadata will need to
explicitly think about how to handle it; hopefully, this will yield
fewer bugs.

In the following I will outline the API and explain how OpenMP will be
lowered. My idea is to follow OpenMP's semantics, so if these differ
from the OpenMP spec, then I'd like to correct that. If there are other
parallelism models that we would like to support, then I think those
can be incorporated as well (maybe something with lightweight tasks
such as Cilk).

---- Parallel Regions ----

Inside a parallel region, a team of threads execute the sequence of
instructions.

A parallel region is specified by a function. This function may be
executed by one or more threads in parallel. In terms of OpenMP:
private() variables become variables local to the function.
firstprivate() variables become parameters to the function. shared()
variables become pass-by-pointer parameters. If the shared variable is
not a global, then we allocate a local copy, using alloca followed by a
store, and pass the new pointer to the function. For copyin()
variables, we pass a copy of the variable to the function, and the
function then uses that copy to update the thread's version of the
(TLS) variable. The function should have private (or internal) linkage
for optimization purposes.

To mark this function as a parallel region, a module-level 'parallel'
metadata entry is created. The call site(s) of this function are marked
with this metadata,. The metadata has entries:
- The string "region"
- A reference to the parallel-region function
- If applicable, a list of metadata references specifying
special-handling child regions (parallel loops and serialized/critical
regions)

If the special-handling region metadata is no longer referenced by code
within the parallel region, then the region has become invalid, and
will be removed (meaning all parallelization metadata will be removed)
by the ParallelizationCleanup. The same is true for all other
cross-referenced metadata below.

Note that parallel regions can be nested.

As a quick example, something like:
int main() {
  int a;
#pragma omp parallel firstprivate(a)
  do_something(a)
  ...
}

becomes something like:

define private void @parreg(i32 %a) {
entry:
  call void @do_something(i32 %a)
  ret
}

define i32 @main() {
entry:
...
call void @parreg1(i32 %a) !parallel !0
...

!0 = metadata !{ metadata !"region", @parreg }

-- Reductions --

To handle reductions, first, the variable is converted into a output
pass-by-pointer parameter to the function. The pointer refers to an
array of values, one for each thread that will execute the region.
After the region completes, a loop must be created to actually perform
the requested reduction. Inside the parallel region, each thread
accesses its value using its thread id as the index. See the nthreads
and tidx intrinsics below.

-- Special handling regions --

- Serial Regions -

Serial regions within parallel blocks (called 'single' in OpenMP) are
executed only by one thread. As with parallel regions themselves, they
are lowered as functions; the call site(s) of these functions are
tagged with 'parallel' metadata. This metadata has entries:
  - The string "serial"
  - A reference to the single-region function
  - A metadata reference to the parent parallel-region or loop metadata
  - Optionally, a type: "master" or "any" (the default)

For regions with "master" only the master thread may execute the
region.

- Critical Regions -

Critical regions are like serial regions, but they are executed by all
threads with mutual-exclusion. These are identified by 'parallel'
metadata with entries:
  - The string "critical"
  - A reference to the critical-region function
  - A metadata reference to the parent parallel-region, loop or task
metadata
  - Optionally, a global name string used for non-local synchronization
(all regions with the same name string are mutually exclusive)

- Loops -

Parallel loops are indicated by tagging all backedge branches with
'parallel' metadata. This metadata has the following entries:
  - The string "loop"
  - A metadata reference to the parent parallel-region metadata
  - Optionally, a string specifying the scheduling mode: "static",
"dynamic", "guided", "runtime", or "auto" (the default)
  - Optionally, an integer specifying the number of loop levels over
which to parallelize (the default is 1)
  - If applicable, a list of metadata references specifying ordered and
serial/critical regions within the loop.

Note that what makes this metadata safe is the cross referencing
between the parent region metadata, the loop metadata and the metadata
references on the instructions. If any of these are removed or become
inconsistent, then the whole parallel region must be removed. The
ParallelizationCleanup pass will check this prior to lowering.

To lower lastprivate() OpenMP variables, first we allocate a copy of
the variable outside the loop. At the end of the loop body we insert a
check to determine if the current iteration is the last one (over all
threads), and if so, we update the common copy with the local version.
Note that for OpenMP loops that have private, firstprivate, etc.
clauses that cannot be made part of the parent parallel region, these
loops will also need to be placed into their own functions to handle
the relevant scope issues.

Ordered regions (those which much execute in the original iteration
order) are lowered as functions, much in the same way as serial
regions. The call site(s) are tagged with 'parallel' metadata. This
metadata has entries:
  - The string "ordered"
  - A reference to the function specifying the ordered region
  - A metadata reference to the parent parallel loop

Serial regions and loop that don't have the 'nowait' OpenMP clause must
be followed by a barrier intrinsic.

- Tasks -

Explicit tasks are also lowered as functions similar to other special
handling regions. Their call site(s) are marked with 'parallel'
metadata. Depending on the implementation, they may not actually start
executing until the main thread executes a taskwait intrinsic or
reaches the end of the parallel region. The task metadata has:
  - The string "task"
  - A reference to the function specifying the task
  - A metadata reference to the parent region, task, loop, etc.
  - Optionally, an affinity mode: "untied" or "tied" (the default). In
tied mode, once a task starts executing in a particular thread, it must
continue to execute in that thread until completion. An untied task can
be passed in between threads.
  - If applicable, a list of metadata references specifying ordered and
serial/critical regions within the task.

-- Intrinsics --

Because metadata does not count as a variable use, and some runtime
controls take general expressions, supporting these requires
intrinsics. Many of these intrinsics are tied to their parent parallel
regions by taking a metadata parameter specifying the parallel region,
loop, etc.

void @llvm.parallel.if(i1, !) - Takes a boolean expression controlling
whether the referenced region (or task) is executed in parallel (the
true case) or in serial (the false case). For a task, this controls the
choice between queued or immediate in-place execution.

void @llvm.parallel.final(i1, !) - Takes a boolean expression
controlling whether the referenced task is considered final. A final
task can have no subtasks (or, for that matter, nested parallel
regions).

void @llvm.parallel.setnt(i32, !) - Specify the number of threads used
to execute the parallel region.

i32 @llvm.parallel.nthreads(!) - Determine the total number of threads
that will be used to execute the referenced parallel region (this is
used to setup the array for reductions).

i32 @llvm.parallel.tidx(!) - Obtain the current thread index; this is
not the global thread id, or even the application-specific thread id.
These indices run only from 0 through one less than the total number of
threads active in the referenced region (this is used to access
elements in a reduction array).

void @llvm.parallel.chunksz(i32 or i64, !) - Specify the size of the
chunks used to decompose a parallel loop. The metadata reference is to
the metadata which tags the loop backedges.

void @llvm.parallel.barrier() - A barrier for all threads in the
current parallel region.

void @llvm.parallel.taskwait() - Wait for all child tasks of the
current task (or all top-level tasks).

void @llvm.parallel.taskyield() - Optionally yield execution to other
tasks.

---- Parallel Sections ----

OpenMP parallel sections are lowered as parallel loops. The loop
executes a fixed number of times (once per section), and within the
loop body a switch statement selects the correct section (in order)
based on the iteration number.

---- Thread-Local Data ----

#pragma omp threadprivate(<variable-list>), which applies only to
global variables, is handled by declaring global variables with the
existing thread_local attribute.

---- Atomic Operations ----

OpenMP atomic operations are encoded using existing LLVM atomic
intrinsics.

---- Flush ----

In general, an OpenMP flush operation, regardless of the contents of
the variable list, can be lowered as: fence seq_cst.

---- Passes ----

-- Early Passes --

ParallelRegionWidening - This is an early pass that tries to combine
consecutive parallel regions. Non-parallel "in between" regions can be
converted into serialized blocks. This can be done so long as any
reductions can be delayed until the end of the last region, and any
converted serial regions do not have external function calls or inline
assembly regions (both of which could be sensitive to the real number
of active threads). This not only reduces thread-startup overhead, but
will also allow other optimizations, such as loop fusion.

-- Late Passes (Lowering) --

The parallelization lowering will be done by IR level passes in CodeGen
prior to SelectionDAG conversion. Currently, this means after
loop-strength reduction. Like loop-strength reduction, these IR level
passes will get a TLI object pointer and will have target-specific
override capabilities.

ParallelizationCleanup - This pass will be scheduled prior to the other
parallelization lowering passes (and anywhere else we decide). Its job
is to remove parallelization metadata that had been rendered
inconsistent by earlier optimization passes. When a parallelization
region is removed, any parallelization intrinsics that can be removed
are then also removed.

ParallelizationLowering - This pass will actual lower paralleliztion
constructs into a combination of runtime-library calls and, optionally,
target-specific intrinsics. I think that an initial generic
implementation will target libgomp.

* I would like to see support for OpenMP 3.1 [1] plus an extension for
  user-defined-reductions (UDRs) [2].

[1] OpenMP Specification 3.1. July, 2011.
    http://www.openmp.org/mp-documents/OpenMP3.1.pdf

[2] A. Duran, et al. "A proposal for User-Defined Reductions in
OpenMP". IWOMP, 2010.
http://www.ccs.tsukuba.ac.jp/workshop/IWOMP2010/slides/Alex-udrs.pdf

Thanks again,
Hal

Hi,

> I'd like to see support in clang/LLVM for multi-core parallelism,
> especially support for OpenMP. I think that the best way to do this is
> by designing an LLVM-based API (metadata and intrinsics) for
> expressing parallelism constructs, and having clang lower OpenMP code
> to that API. This will allow maximal preservation of optimization
> capabilities including target-specific lowering. What follows outlines
> a set of metadata and intrinsics which should allow support for the
> full OpenMP specification, and I'd like to know what the community
> thinks about this.

Something like this would be useful also for OpenCL C
work group parallelization. At the moment in pocl we do this in a
hackish way with an "overkill" OpenCL C-specific metadata that is fed
to a modified bb-vectorizer of yours for autovectorization and
a custom alias analyzer for AA benefits.

I'd like to remind that multithreading is just one option on how
to map the "parallel regions/loops" in parallel programs to parallel
hardware. Within a single core, vectorization/DLP (SIMD/vector extensions)
and static ILP (basically VLIW) are the other interesting ones. In order
to exploit all the parallel resources one could try to intelligently
combine the mapping over all of those.

Also, one user of this metadata could be the alias analysis: it should
be easy to write an AA that can exploit the parallelism
information. Parallel regions by definition do not have (defined)
dependencies between each other (between synchronization points) which
should be useful information for optimization purposes even if
parallel hardware was not targeted.

- Loops -

Parallel loops are indicated by tagging all backedge branches with
'parallel' metadata. This metadata has the following entries:
   - The string "loop"
   - A metadata reference to the parent parallel-region metadata
   - Optionally, a string specifying the scheduling mode: "static",
"dynamic", "guided", "runtime", or "auto" (the default)
   - Optionally, an integer specifying the number of loop levels over
which to parallelize (the default is 1)
   - If applicable, a list of metadata references specifying ordered and
serial/critical regions within the loop.

IMHO the generic metadata used to mark parallelism (basically to denote
independence of iterations in this case) should be separated from OpenMP-
specific ones such as the scheduling mode. After all, there are and will be
more of parallel programming languages/standards in the future than just
OpenMP that could generate this new metadata and get the mapping to the
parallel hardware (via thread library calls or autovectorization, for
example) automagically.

-- Late Passes (Lowering) --

The parallelization lowering will be done by IR level passes in CodeGen
prior to SelectionDAG conversion. Currently, this means after
loop-strength reduction. Like loop-strength reduction, these IR level
passes will get a TLI object pointer and will have target-specific
override capabilities.

ParallelizationCleanup - This pass will be scheduled prior to the other
parallelization lowering passes (and anywhere else we decide). Its job
is to remove parallelization metadata that had been rendered
inconsistent by earlier optimization passes. When a parallelization
region is removed, any parallelization intrinsics that can be removed
are then also removed.

ParallelizationLowering - This pass will actual lower paralleliztion
constructs into a combination of runtime-library calls and, optionally,
target-specific intrinsics. I think that an initial generic
implementation will target libgomp.

A vectorization pass could trivially vectorize parallel loops
without calls etc. here.

BR,

Hi,

> I'd like to see support in clang/LLVM for multi-core parallelism,
> especially support for OpenMP. I think that the best way to do
> this is by designing an LLVM-based API (metadata and intrinsics)
> for expressing parallelism constructs, and having clang lower
> OpenMP code to that API. This will allow maximal preservation of
> optimization capabilities including target-specific lowering. What
> follows outlines a set of metadata and intrinsics which should
> allow support for the full OpenMP specification, and I'd like to
> know what the community thinks about this.

Something like this would be useful also for OpenCL C
work group parallelization. At the moment in pocl we do this

I had thought about uses for shared-memory OpenCL implementations, but
I don't know enough about the use cases to make a specific proposal. Is
your metadata documented anywhere?

in a
hackish way with an "overkill" OpenCL C-specific metadata that is fed
to a modified bb-vectorizer of yours for autovectorization and
a custom alias analyzer for AA benefits.

I'd like to remind that multithreading is just one option on how
to map the "parallel regions/loops" in parallel programs to parallel
hardware. Within a single core, vectorization/DLP (SIMD/vector
extensions) and static ILP (basically VLIW) are the other interesting
ones. In order to exploit all the parallel resources one could try to
intelligently combine the mapping over all of those.

I agree, and this is specifically why I don't want to support OpenMP by
lowering it into runtime calls in the frontend. I want to allow for
other optimizations (vectorization, etc.) in combination
with (or instead of) multi-threading. I think that my current proposal
allows for that.

Also, one user of this metadata could be the alias analysis: it should
be easy to write an AA that can exploit the parallelism
information. Parallel regions by definition do not have (defined)
dependencies between each other (between synchronization points) which
should be useful information for optimization purposes even if
parallel hardware was not targeted.

I really like this idea! -- and it sounds like you may already have
something like this in POCL?

> - Loops -
>
> Parallel loops are indicated by tagging all backedge branches with
> 'parallel' metadata. This metadata has the following entries:
> - The string "loop"
> - A metadata reference to the parent parallel-region metadata
> - Optionally, a string specifying the scheduling mode: "static",
> "dynamic", "guided", "runtime", or "auto" (the default)
> - Optionally, an integer specifying the number of loop levels
> over which to parallelize (the default is 1)
> - If applicable, a list of metadata references specifying
> ordered and serial/critical regions within the loop.

IMHO the generic metadata used to mark parallelism (basically to
denote independence of iterations in this case) should be separated
from OpenMP- specific ones such as the scheduling mode. After all,
there are and will be more of parallel programming
languages/standards in the future than just OpenMP that could
generate this new metadata and get the mapping to the parallel
hardware (via thread library calls or autovectorization, for example)
automagically.

I think that making the metadata more modular sounds like a good idea.

Regarding having scheduling be separate, care is required to ensure
correctness. A large constraint on the design of a metadata API is that
different pieces of metadata can be independently dropped by
transformation passes, and that must be made safe w.r.t. the correctness
of the code. For example, if a user specified that an OpenMP loop is to
be parallelized with runtime scheduling, then if an OpenMP parallel loop
is generated, we need to be sure to honor the runtime scheduling mode.
I've tried propose metadata with a sufficient amount of
cross-referencing so that dropping any piece of metadata will preserve
correctness (even if that means loosing a parallel region).

> -- Late Passes (Lowering) --
>
> The parallelization lowering will be done by IR level passes in
> CodeGen prior to SelectionDAG conversion. Currently, this means
> after loop-strength reduction. Like loop-strength reduction, these
> IR level passes will get a TLI object pointer and will have
> target-specific override capabilities.
>
> ParallelizationCleanup - This pass will be scheduled prior to the
> other parallelization lowering passes (and anywhere else we
> decide). Its job is to remove parallelization metadata that had
> been rendered inconsistent by earlier optimization passes. When a
> parallelization region is removed, any parallelization intrinsics
> that can be removed are then also removed.
>
> ParallelizationLowering - This pass will actual lower paralleliztion
> constructs into a combination of runtime-library calls and,
> optionally, target-specific intrinsics. I think that an initial
> generic implementation will target libgomp.

A vectorization pass could trivially vectorize parallel loops
without calls etc. here.

I agree. I think that vectorization is best done earlier in the
optimization schedule. Vectorization, however, should appropriately
update loop metadata to allow for proper integration with
parallelization, etc. Lowering to runtime libraries (for
multi-threading in whatever form) should be done relatively late in
the process (because further higher-level optimizations are often not
possible after that point).

Thanks for your comments! Please feel free to propose specific metadata
forms and/or intrinsics to capture your ideas; then we can work on
combining them.

-Hal

I had thought about uses for shared-memory OpenCL implementations, but
I don't know enough about the use cases to make a specific proposal. Is
your metadata documented anywhere?

It is now a quick "brute force hack", that's why I got interested in your
proposal. We just wanted to communicate the OpenCL work item information
further down in the compiler as easily as possible and didn't have time
to beautify it.

Now all instructions of the "chained" OpenCL kernel instances
(work items) are annotated with their work item ID, their "parallel region
ID" (from which region between barriers the instruction originates from) and
a sequence ID. So, lots of metadata bloat.

These annotations allow finding the matching instructions later on to
vectorize multiple work items together by just combining the matching
instructions from the different WIs. The alias analyzer uses this
metadata to return NO_ALIAS for any memory access combination where
the accesses are from different work items within the same parallel
region (the specs say if they do alias, the results are undefined,
thus a programmer's fault).

With your annotations this hack could be probably cleaned up by using the
"parallel for loop" metadata which the vectorizer and/or "thread lib call
injector" (or the static instruction scheduler for a VLIW/TTA) can then
use to parallelize the kernel as desired.

I'd remind that its usefulness is not limited to a shared memory
multicore (or even multicore) for the kernel execution device. All
non-SIMT targets require laying out the code for all the work-items
(like they were parallel for loops, unrolled or vectorized or not) for
valid OpenCL kernel execution when there are more than 1 WI per
work-group, thus potentially benefit from this.

I agree, and this is specifically why I don't want to support OpenMP by
lowering it into runtime calls in the frontend. I want to allow for
other optimizations (vectorization, etc.) in combination
with (or instead of) multi-threading. I think that my current proposal
allows for that.

Yes it should, as far as I can see. If the loop body is a function and
the iteration count (or its multiple) is known, one should be able to
(vectorize multiple copies of the function without dependence checking.
In the multi-WI OpenCL C case this function would contain the code for a
single work item between a region between barriers (implicit or not).

I'm unsure if forcing the function extraction of the parallel
regions brings unnecessary problems or not. Another option would be to
mark the basic blocks that form parallel regions. Maybe all of the BBs
could be marked with a PR identifier MD? This would require BB
metadata (are they supported?).

Also, one user of this metadata could be the alias analysis: it should
be easy to write an AA that can exploit the parallelism
information. Parallel regions by definition do not have (defined)
dependencies between each other (between synchronization points) which
should be useful information for optimization purposes even if
parallel hardware was not targeted.

I really like this idea! -- and it sounds like you may already have
something like this in POCL?

Yes, an OpenCL AA that exploits the work-item independence and address
space independence. With your annotations there could be a generic
AA for the "independence information from parallelism metadata" part and
a separate OpenCL-specific AA for the rest.

Regarding having scheduling be separate, care is required to ensure
correctness. A large constraint on the design of a metadata API is that

OK, I see.

I suppose it's not a big deal to add the scheduling property. At
least if one (later) allows adding scheduling modes supported by other
standards than OpenMP as well. I.e., not modes like "static" but
"openmp31_static" or similar. For OpenCL work item loops the
scheduling mode could be "auto" or left empty.

I agree. I think that vectorization is best done earlier in the
optimization schedule. Vectorization, however, should appropriately
update loop metadata to allow for proper integration with
parallelization, etc. Lowering to runtime libraries (for
multi-threading in whatever form) should be done relatively late in
the process (because further higher-level optimizations are often not
possible after that point).

Yes, to enable automatic mixing of vectorization and threading from
the single (data parallel) kernel.

> I had thought about uses for shared-memory OpenCL implementations,
> but I don't know enough about the use cases to make a specific
> proposal. Is your metadata documented anywhere?

It is now a quick "brute force hack", that's why I got interested in
your proposal. We just wanted to communicate the OpenCL work item
information further down in the compiler as easily as possible and
didn't have time to beautify it.

Now all instructions of the "chained" OpenCL kernel instances
(work items) are annotated with their work item ID, their "parallel
region ID" (from which region between barriers the instruction
originates from) and a sequence ID. So, lots of metadata bloat.

These annotations allow finding the matching instructions later on to
vectorize multiple work items together by just combining the matching
instructions from the different WIs. The alias analyzer uses this
metadata to return NO_ALIAS for any memory access combination where
the accesses are from different work items within the same parallel
region (the specs say if they do alias, the results are undefined,
thus a programmer's fault).

With your annotations this hack could be probably cleaned up by using
the "parallel for loop" metadata which the vectorizer and/or "thread
lib call injector" (or the static instruction scheduler for a
VLIW/TTA) can then use to parallelize the kernel as desired.

I'd remind that its usefulness is not limited to a shared memory
multicore (or even multicore) for the kernel execution device. All
non-SIMT targets require laying out the code for all the work-items
(like they were parallel for loops, unrolled or vectorized or not) for
valid OpenCL kernel execution when there are more than 1 WI per
work-group, thus potentially benefit from this.

Fair enough. My Thought process here was that, first, I was not going
to propose anything specifically for non-shared-memory systems (those
require data copying directives, and I'd want to let others who have
experience with those do the proposing), and second, I was not going to
propose anything specifically for multi-target (heterogeneous) systems.
I think that single-target shared-memory systems fall into the model
I've sketched, and support for anything else will require further
extension.

> I agree, and this is specifically why I don't want to support
> OpenMP by lowering it into runtime calls in the frontend. I want to
> allow for other optimizations (vectorization, etc.) in combination
> with (or instead of) multi-threading. I think that my current
> proposal allows for that.

Yes it should, as far as I can see. If the loop body is a function and
the iteration count (or its multiple) is known, one should be able to
(vectorize multiple copies of the function without dependence
checking. In the multi-WI OpenCL C case this function would contain
the code for a single work item between a region between barriers
(implicit or not).

I'm unsure if forcing the function extraction of the parallel
regions brings unnecessary problems or not. Another option would be to
mark the basic blocks that form parallel regions. Maybe all of the BBs
could be marked with a PR identifier MD? This would require BB
metadata (are they supported?).

I thought about this. There had been some patches provided for BB
metadata (by Ralf Karrenberg back in May), I don't recall what happened
with those. BB metadata might work, but I worry about existing
optimization passes, which don't know about this metadata, moving
things in and out of parallel regions in illegal ways. For example,
moving a call to some get_number_of_threads() function, or some inline
assembly region, in or out of a parallel region. Putting things in
functions just seemed safer (and BB metadata is not upstream). Also, it
would require extra checking to keep the parallel basic blocks together.
Furthermore, in many cases, the parallel regions need to end up as
separate functions anyway (because their passed as callbacks to the
runtime library).

>> Also, one user of this metadata could be the alias analysis: it
>> should be easy to write an AA that can exploit the parallelism
>> information. Parallel regions by definition do not have (defined)
>> dependencies between each other (between synchronization points)
>> which should be useful information for optimization purposes even
>> if parallel hardware was not targeted.
>
> I really like this idea! -- and it sounds like you may already have
> something like this in POCL?

Yes, an OpenCL AA that exploits the work-item independence and address
space independence. With your annotations there could be a generic
AA for the "independence information from parallelism metadata" part
and a separate OpenCL-specific AA for the rest.

> Regarding having scheduling be separate, care is required to ensure
> correctness. A large constraint on the design of a metadata API is
> that

OK, I see.

I suppose it's not a big deal to add the scheduling property. At
least if one (later) allows adding scheduling modes supported by other
standards than OpenMP as well. I.e., not modes like "static" but
"openmp31_static" or similar. For OpenCL work item loops the
scheduling mode could be "auto" or left empty.

I think that this makes sense. For some things, like 'static', we can
define backend-independent semantics. For other things, like OpenMP's
'runtime', which is tied to how the application calls OpenMP runtime
functions, I agree, we should probably call that 'openmp_runtime' (or
something like that).

> I agree. I think that vectorization is best done earlier in the
> optimization schedule. Vectorization, however, should appropriately
> update loop metadata to allow for proper integration with
> parallelization, etc. Lowering to runtime libraries (for
> multi-threading in whatever form) should be done relatively late in
> the process (because further higher-level optimizations are often
> not possible after that point).

Yes, to enable automatic mixing of vectorization and threading from
the single (data parallel) kernel.

Yep, that is exactly what I want to be able to do.

Thanks again,
Hal

Hi,

Sorry for the hiatus, busy time at my university. :slight_smile:

After a false start and some (hopefully cogent) thought, I am now of
the opinion that it will be better to have llvm natively support a
somewhat different notion of parallel computation and have the
frontend lower OpenMP directives (and possibly other such things) into
the same.

In short, I propose a intrinsic based approach which hinges on the
concept of a "parallel map". The immediate effect of using intrinsics
is that we no longer have to worry about missing metadata. Moreover,
we are still free to lower the intrinsics in a variety of ways --
including vectorizing them or lowering them to calls to an actual
openmp backend.

I have also tried to make this representation more orthogonal and
general; mirroring a significant subset of openmp directives inside
llvm's IR doesn't feel right. For instance, in the following
proposal, the OpenMP TASK directive is lowered using the more general
parallel_map construct. A pass lowering the intrinsics into an OpenMP
backend may "reverse engineer" the mentioned pattern into tasks when
possible, but, in principle, I think the directives we introduce into
llvm's IR are best left as mutually exclusive as possible. Keeping
the intrinsics simple and orthogonal should also help in asserting and
verifying correctness.

I plan to first implement a null lowering pass which simply lowers intrinsics
to something semantically correct but with no multithreading. Once that is
done, I'll try to lower the intrinsics into something more interesting, perhaps
libgomp or maybe even a custom runtime.

The proposal

How are you representing things like various scheduling mechanisms without metadata - extra parameters to intrinsics ?
- dibyendu

Hi,

How are you representing things like various scheduling mechanisms without metadata - extra parameters to intrinsics ?

You're right, there should be an extra parameter to parallel_map
indicating the scheduling policy.

Hi,

Sorry for the hiatus, busy time at my university. :slight_smile:

After a false start and some (hopefully cogent) thought, I am now of
the opinion that it will be better to have llvm natively support a
somewhat different notion of parallel computation and have the
frontend lower OpenMP directives (and possibly other such things) into
the same.

In short, I propose a intrinsic based approach which hinges on the
concept of a "parallel map". The immediate effect of using intrinsics
is that we no longer have to worry about missing metadata. Moreover,
we are still free to lower the intrinsics in a variety of ways --
including vectorizing them or lowering them to calls to an actual
openmp backend.

I have also tried to make this representation more orthogonal and
general; mirroring a significant subset of openmp directives inside
llvm's IR doesn't feel right. For instance, in the following
proposal, the OpenMP TASK directive is lowered using the more general
parallel_map construct. A pass lowering the intrinsics into an OpenMP
backend may "reverse engineer" the mentioned pattern into tasks when
possible, but, in principle, I think the directives we introduce into
llvm's IR are best left as mutually exclusive as possible. Keeping
the intrinsics simple and orthogonal should also help in asserting and
verifying correctness.

I plan to first implement a null lowering pass which simply lowers
intrinsics to something semantically correct but with no
multithreading. Once that is done, I'll try to lower the intrinsics
into something more interesting, perhaps libgomp or maybe even a
custom runtime.

The proposal
------------

I propose introducing four new intrinsics to llvm:

1. void @llvm.parallel_map(i32 limit, void (i32, i8 *) fn, i8* priv)

Semantics:

Executes `limit` copies of fn, _possibly_ in parallel. The map index
(i32, ranging from 0 to (limit - 1) both inclusive) and `priv` are
passed to each of the invocations. The only invariant is that the 0th
iteration is always executed in the invoking thread.

It is legal to have calls to parallel_map inside a function being
parallel_map'ed over.

2. void @llvm.sync_region(void (i32, i8 *) fn, i8 type)

Semantics:

It is only legal to call sync_region from within the dynamic extent of
a parallel_map. It ensures that `limit` copies (the `limit` is the
limit from the parallel_map) of fn are executed with mutual exclusion.

`type` can either be 0 (`Any`) signifying that the synchronized
regions can be run in any order or 1 (`Ordered`) signifying that the
synchronized regions must be run in increasing order of the index.

3. i32 @llvm.get_num_threads()

Semantics:

Returns the number of threads in the thread pool.

4. i32 @llvm.set_num_threads(i32)

Set the number of threads in the thread pool.

It should be possible to lower all OpenMP directives to the above four
intrinsics in the frontend (please read this in conjunction with [1]):

Parallel regions can be lowered as a parallel_map with
@llvm.num_threads as the limit.

#pragma PARALLEL
  block

desugars to

@llvm.parallel_map(num_threads, block_closure, shared_closure)
...

void block_closure(i32 tid, i8* shared_vars) {
 block
}

Reductions are handled by a parallel_map followed by a regular
reduction loop (exactly as in Hal's proposal).

Serial blocks reduce to a block conditional on the index inside the
function being parallelly mapped.

We lower critical and ordered regions into calls to `sync_region`.

Tasks are lowered (recursively) this way:

TASK
  block
more_code_may_contain_more_tasks
TASK_WAIT

desugars to

@llvm.parallel_map(2, task_closure, task_priv)

void task_closure(i32 index, i8* private) {
  if (index == 0) {
    more_code_may_contain_more_tasks
  } else {
    block
  }
}

Parallel loops are basically `parallel_map`s.

Thoughts?

Most of this sounds alright, but a few things worry me. My approach was
designed to require minimal changes to the rest of the infrastructure.
In your case you'll need to:

- Teach LoopInfo, ScalarEvolution, etc. how to 'see' the
   loops spanning the parallel_map calls. You'll need to teach LICM,
   etc. how to do their transformations in the presence of
   parallel_map. This may require that these passes be promoted to
   module-level passes, and have other undesirable consequences.

- Teach the inliner and other associated passes to
   understand the parallel_map intrinsic. Especially when using the
   null implementation (or an implementation that does not actually
   support task dispatch), small tasks should be considered for
   inlining.

In short, I think that your approach will require a lot more work to
get to a production-quality state than mine. Your approach has the
advantage of a simpler API, but I want to make sure that you've thought
through the various changes to the existing passes that will be
necessary.

Thanks again,
Hal

Hi,

Also, one user of this metadata could be the alias analysis: it should
be easy to write an AA that can exploit the parallelism
information. Parallel regions by definition do not have (defined)
dependencies between each other (between synchronization points) which
should be useful information for optimization purposes even if
parallel hardware was not targeted.

I'm a bit "off-topic" : this assumption seems strong to me : I don't think it is forbidden to have dependencies between two iterations of a parallel loop for instance. I agree it will not be deterministic but it is not necessarily an issue.

Best,

Mehdi Amini

Sanjoy Das <sanjoy@playingwithpointers.com> writes:

In short, I propose a intrinsic based approach which hinges on the
concept of a "parallel map". The immediate effect of using intrinsics
is that we no longer have to worry about missing metadata. Moreover,
we are still free to lower the intrinsics in a variety of ways --
including vectorizing them or lowering them to calls to an actual
openmp backend.

I'll re-ask here since this is in its own thread.

Why can't we just make ordinary function calls to runtime routines?

                     -David

I agree. I can't imagine any practical way that a metadata-based approach could be preserved by optimizers.

-Chris

> Sanjoy Das <sanjoy@playingwithpointers.com> writes:
>
>> In short, I propose a intrinsic based approach which hinges on the
>> concept of a "parallel map". The immediate effect of using
>> intrinsics is that we no longer have to worry about missing
>> metadata. Moreover, we are still free to lower the intrinsics in
>> a variety of ways -- including vectorizing them or lowering them
>> to calls to an actual openmp backend.
>
> I'll re-ask here since this is in its own thread.
>
> Why can't we just make ordinary function calls to runtime routines?

I agree. I can't imagine any practical way that a metadata-based
approach could be preserved by optimizers.

Regarding the metadata approach, it depends on what you mean by
preserved. The trick is to make sure that transformations that don't
understand the metadata can't cause miscompiles. The specific scheme
that I proposed used a combination of procedurization and
cross-referencing metadata such that invalidated parallel metadata can
be detected and the entire enclosing parallel region can be dropped.

The proposal from Intel, which more-heavily uses intrinsics, has other
advantages, but will require more modifications to existing passes to
realize its potential optimization benefits.

-Hal

My comment was mostly in response to the Intel proposal, which effectively translates OpenMP pragmas directly into llvm intrinsics + metadata. I can't imagine a way to make this work *correctly* without massive changes to the optimizer.

-Chris

Yes, this. Absolutely.

-eric

>
>
>> Sanjoy Das <sanjoy@playingwithpointers.com> writes:
>>
>>> In short, I propose a intrinsic based approach which hinges on the
>>> concept of a "parallel map". The immediate effect of using
>>> intrinsics is that we no longer have to worry about missing
>>> metadata. Moreover, we are still free to lower the intrinsics in
>>> a variety of ways -- including vectorizing them or lowering them
>>> to calls to an actual openmp backend.
>>
>> I'll re-ask here since this is in its own thread.
>>
>> Why can't we just make ordinary function calls to runtime routines?
>
> I agree. I can't imagine any practical way that a metadata-based
> approach could be preserved by optimizers.
>

Yes, this. Absolutely.

I think, in that case, that both you (and Chris) are being somewhat
unimaginative. At this point, I believe that several workable proposals
have been put forward, and what we now need is detailed analysis and
review.

As I've stated, whether the metadata is preserved is not really the
relevant metric. It is fine for a pass that does not understand
parallelization metadata to drop it. The important part is that dropping
the metadata, and moving instructions to which that metadata is
attached, must not cause miscompiles. For example:

- Instructions with unknown side effects or dependencies must not be
   moved from outside a parallel region to inside a parallel region.
- Serialized subregions inside of parallel regions cannot be deleted
   without deleting the enclosing parallel region.

The outstanding proposals have ways of dealing with these things. In
the case of my proposal, it is though cross-referencing the metadata
sufficiently and using function boundaries to prevent unwanted code
motion. In Intel's case, it is by using the barriers implied by the
intrinsics calls.

-Hal

Hal Finkel <hfinkel@anl.gov> writes:

As I've stated, whether the metadata is preserved is not really the
relevant metric. It is fine for a pass that does not understand
parallelization metadata to drop it. The important part is that dropping
the metadata, and moving instructions to which that metadata is
attached, must not cause miscompiles. For example:

- Instructions with unknown side effects or dependencies must not be
   moved from outside a parallel region to inside a parallel region.
- Serialized subregions inside of parallel regions cannot be deleted
   without deleting the enclosing parallel region.

The outstanding proposals have ways of dealing with these things. In
the case of my proposal, it is though cross-referencing the metadata
sufficiently and using function boundaries to prevent unwanted code
motion. In Intel's case, it is by using the barriers implied by the
intrinsics calls.

These two paragraphs seem contradictory to me. How can a pass rely on
the metadata to not do illegal code motion if the pass has dropped the
metadata? I must be missing something important.

The only way I can think that this would work is that the explicit
outlining is already done so there is no way to move between
parallel/non-parallel without going all interprocedurally bonkers. :slight_smile:

This is the kind of thing that worries me about these proposals.

                           -David

Hal Finkel <hfinkel@anl.gov> writes:

> As I've stated, whether the metadata is preserved is not really the
> relevant metric. It is fine for a pass that does not understand
> parallelization metadata to drop it. The important part is that
> dropping the metadata, and moving instructions to which that
> metadata is attached, must not cause miscompiles. For example:

> - Instructions with unknown side effects or dependencies must not
> be moved from outside a parallel region to inside a parallel region.
> - Serialized subregions inside of parallel regions cannot be
> deleted without deleting the enclosing parallel region.
>
> The outstanding proposals have ways of dealing with these things. In
> the case of my proposal, it is though cross-referencing the metadata
> sufficiently and using function boundaries to prevent unwanted code
> motion. In Intel's case, it is by using the barriers implied by the
> intrinsics calls.

These two paragraphs seem contradictory to me. How can a pass rely on
the metadata to not do illegal code motion if the pass has dropped the
metadata? I must be missing something important.

The only way I can think that this would work is that the explicit
outlining is already done so there is no way to move between
parallel/non-parallel without going all interprocedurally bonkers. :slight_smile:

Yes, this is exactly what I mean. The metadata needs to be
appropriately cross-referenced, so that if any parallelization
metadata within some parallel region is dropped, then this can be
detected, and the entire parallel region can be dropped. The code
motion would be prevented by explicit outlining. The inliner would need
to be taught not to inline functions with parallelization
metadata (when non-trivial parallelization is enabled). That, however,
seems like a small and simple change.

In Intel's proposal, code motion is prevented because the
parallelization intrinsics serve as explicit scheduling barriers. We'd
need, I suppose, to enhance various passes to understand when it could
override the barrier and move code regardless (for optimization).

-Hal

> Sanjoy Das <sanjoy@playingwithpointers.com> writes:
>
>> In short, I propose a intrinsic based approach which hinges on the
>> concept of a "parallel map". The immediate effect of using
>> intrinsics is that we no longer have to worry about missing
>> metadata. Moreover, we are still free to lower the intrinsics in
>> a variety of ways -- including vectorizing them or lowering them
>> to calls to an actual openmp backend.
>
> I'll re-ask here since this is in its own thread.
>
> Why can't we just make ordinary function calls to runtime routines?

I agree. I can't imagine any practical way that a metadata-based
approach could be preserved by optimizers.

Do you think it would be better to turn these things into actual syntax
extensions? This would obviously require more infrastructure work, but
it looks like we have a sufficient number of interested parties to
implement the changes.

Then we'd still need to decide what the appropriate level of
abstraction should be at the IR level. Opinions?

Thanks again,
Hal

>
>>
>>
>>> Sanjoy Das <sanjoy@playingwithpointers.com> writes:
>>>
>>>> In short, I propose a intrinsic based approach which hinges on
>>>> the concept of a "parallel map". The immediate effect of using
>>>> intrinsics is that we no longer have to worry about missing
>>>> metadata. Moreover, we are still free to lower the intrinsics in
>>>> a variety of ways -- including vectorizing them or lowering them
>>>> to calls to an actual openmp backend.
>>>
>>> I'll re-ask here since this is in its own thread.
>>>
>>> Why can't we just make ordinary function calls to runtime
>>> routines?
>>
>> I agree. I can't imagine any practical way that a metadata-based
>> approach could be preserved by optimizers.
>
> Regarding the metadata approach, it depends on what you mean by
> preserved. The trick is to make sure that transformations that don't
> understand the metadata can't cause miscompiles. The specific scheme
> that I proposed used a combination of procedurization and
> cross-referencing metadata such that invalidated parallel metadata
> can be detected and the entire enclosing parallel region can be
> dropped.
>
> The proposal from Intel, which more-heavily uses intrinsics, has
> other advantages, but will require more modifications to existing
> passes to realize its potential optimization benefits.

My comment was mostly in response to the Intel proposal, which
effectively translates OpenMP pragmas directly into llvm intrinsics +
metadata. I can't imagine a way to make this work *correctly*
without massive changes to the optimizer.

Also, I should mention that Sanjoy's recommendation, which is to move
the parallelization state into an analysis pass, might make sense here.
If not all intermediate passes preserve the analysis, then the state
will be lost, and no parallelization will occur. In the context of
OpenMP, where parallelization is essentially optional, I think this
should be fine.

In any case, if we mark the intrinsics has having unknown side effects
then they'll serve as barriers for code motion. I *think* that this
would also inhibit loop restructuring (or could be made to do so) so
loop annotations could be kept properly associated with the intended
code, but this would need to be checked.

-Hal