Thank you for the reply!
> As you may know, this is the third such proposal over the past two
> months, one by me
> and the other, based somewhat on mine, by Sanjoy
Yes, I was aware of your proposal. I hesitated to make any comments or
criticism -- as I am, obviously, biased.
In my opinion, two most important differences between our proposals
1) Your design employs explicit procedurization done in front-end,
while our design allows both early (right after front-end) and late
(later in the back-end) procedurization.
2) You aim to provide a general support for all (or at least most)
parallel standards, while our aim is more modest -- just OpenMP.
To be fair, my proposal was also fairly OpenMP specific.
Please see discussion of 1) in "Function Outlining" section of our
As for 2), there are many arguments one might use in favor of more
general or more specialized solution. What is easier to implement?
I feel that my proposal is easier to implement because it is safer:
because of the procedurization and the cross-referencing of the
metadata, passes that don't know about the parallelization metadata and
drop it will cause parallel regions to be lost, but should not
otherwise lead to miscompiled code, and with inlining, most
optimization opportunities are preserved.
I agree that your proposal allows more optimization opportunities to be
preserved. On the other hand, it will require more auditing of existing
code, and new infrastructure just to make the new intrinsics don't
interfere with existing optimizations. I trust that you have sufficient
resources to do these things, and that being the case, I don't object.
What is better for LLVM IR development? Are we sure what we see as
necessary and sufficient today would be suitable for future parallel
I guarantee that the answer is no -- but there are a number of
current standards that can be considered.
-- given all the developments happening in this area as we
speak? Whatever one answers, it would be quite subjective. My personal
preference is for simplest and most focused solution -- but then again
this is subjective.
> In order for your proposal to work well, there will be a lot of
> infrastructure work required (more than with my proposal); many
> passes will need to be made explicitly aware of how they can, or
> can't, reorder things with respect to the parallelization
> intrinsics; loop restructuring may require special care, etc. How
> this is done depends in part on where the state information is
> stored: Do we keep the parallelization information in the
> intrinsics during mid-level optimization, or do we move its state
> into an analysis pass? In any case, I don't object to this approach
> so long as we have a good plan for how this work will be done.
No -- only passes than happen before procedurization should be aware
of these intrinsics.
This answer is fairly ambiguous because you haven't explained exactly
when this will happen. I assume that it will happen fairly late. For
some things, like atomics lowering, we may want to wait until just
prior to code generation to allow late customization by target-specific
I agree that it is not so easy to make optimizations "thread-aware".
But the problem essentially the same, no matter how parallel extension
is manifested in the IR.
> When we discussed this earlier this year, there seemed to be some
> consensus that we wanted to avoid, to the extent possible,
> introducing OpenMP-specific intrinsics into the LLVM layer. Rather,
> we should define some kind of parallelization API (in the form of
> metadata, intrinsics, etc.) onto which OpenMP can naturally map
> along with other paradigms. There is interest in supporting
> OpenACC, for example, which will require data copying clauses, and
> it would make sense to share as much of the infrastructure as
> possible with OpenMP. Are you interested in providing Cilk support
> as well? We probably don't want to have NxM slightly-different ways
> of expressing 'this is a parallel region'. There are obviously
> cases in which things need to be specific to the interface (like
> runtime loop scheduling in OpenMP which implies a specific
> interaction with the runtime library), but such cases may be the
> exception rather than the rule.
> We don't need 'omp' in the intrinsic names and also 'OMP_' on all of
> the string specifiers. Maybe, to my previous point, we could call
> the intrinsics 'parallel' and use 'OMP_' only when something is
> really OpenMP-specific?
As I said before, our aim was quite simple -- OpenMP support only.
Fair enough, but that does not explain why, even with a restricted
scope, we need to repeat 'omp' in both the intrinsic name and its
As far as I can tell, what you've proposed is a fairly generic way to
pass pragma-type information from the frontend to the backend. Going
through all of the effort to implement that only to arbitrarily
restrict it to OpenMP pragmas seems silly. Having this capability would
be great, and we could use it for other things. For example, I'd like
to have a '#pragma unroll(n)' for loops. If we have a generic way to
pass such contextual pragmas to the backend, it would make supporting
such extensions much easier.
Can the design be extended to allow more general form of parallel
extensions support? Probably... but this is definitely more than what
> You don't seem to want to map thread-private variables onto the
> existing TLS support. Why?
Because we don't employ explicit procedurization. What happens after
procedurization (including how thread-private variables are manifested
in the IR) is heavily dependent on OpenMP runtime library one relies
upon and out of scope of our proposal.
I thought that thread-private variables in OpenMP could be declared
only a global scope. This makes them map cleanly to existing TLS
support, and I don't see how the intrinsics will work in this case
(because you can't call intrinsics at global scope). That having been
said, I recommend that we introduce a new 'omp' TLS mode so that the
implementation is free to choose the most-appropriate lowering.