Adding "simd" pragma to Clang

Hi All,

Continuing (and forking :-)) the discussion started by Renato (http://lists.cs.uiuc.edu/pipermail/cfe-dev/2014-February/035162.html), what is the community’s opinion on introducing pragma simd support in clang?

One possibility is to commit “#pragma omp simd” implementation, which is a part of OpenMP 4 standard (http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf, section 2.8). I see two downsides here: a) “omp” prefix (which might be confusing to some) and b) necessity to add -fopenmp to enable support (this also automatically links OpenMP RTL, which has nothing to do with this particular pragma).

Another possibility is to drop omp prefix – this removes the downsides mentioned above (and yes, makes the pragma similar to what is proposed in CilkPlus – not necessarily a bad thing in itself).

IMHO, both alternatives provide a useful, standards-based tool to control vectorization in clang / llvm compiler – while also advancing clang’s compatibility with existing standards.

We (Intel) can contribute all required code – implementation is ready, just needs some massaging before submission to clang trunk.

Opinions?

Yours,
Andrey

One possibility is to commit "#pragma omp simd" implementation, which is a
part of OpenMP 4 standard

Hi Andrey,

I see no problems with it. This is not the same as what I proposed
back then, but really, it's interesting to have anyway. The problems
you list come from trying to use omp pragmas as standard vectorization
pragmas, which is not ideal. But if you treat them as just extended
support for OMP, why not?

Another possibility is to drop omp prefix -- this removes the downsides
mentioned above (and yes, makes the pragma similar to what is proposed in
CilkPlus -- not necessarily a bad thing in itself).

Cilk might be closed to what we had in mind. The pragmas I've seen on Cilk:

#pragma simd
#pragma vector always

Could enable the three metadata we have already in IR, so if you guys
could introduce that, it'd be great!

cheers,
--renato

Personally, I'd love to see this as an "experimental" extension to OMP and work towards getting it in the standard. Having a functional and open source implementation which gets used in the real world makes for a strong argument for incorporating it in future releases. The goals are complementary to OMP and long term it's more simple to have all similar pragma under one domain/family "omp"

Hi All,

Continuing (and forking :-)) the discussion started by Renato (http://lists.cs.uiuc.edu/pipermail/cfe-dev/2014-February/035162.html), what is the community’s opinion on introducing pragma simd support in clang?

One possibility is to commit “#pragma omp simd” implementation, which is a part of OpenMP 4 standard (http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf, section 2.8). I see two downsides here: a) “omp” prefix (which might be confusing to some) and b) necessity to add -fopenmp to enable support (this also automatically links OpenMP RTL, which has nothing to do with this particular pragma).

Andrey, thanks for continuing the discussion and the work on Renato’s proposal. I’d be happy to help with hooking up the vectorizer to the OpenMP front end implementation. This work will certainly improve our OMP implementation.

Another possibility is to drop omp prefix – this removes the downsides mentioned above (and yes, makes the pragma similar to what is proposed in CilkPlus – not necessarily a bad thing in itself).

IMHO, both alternatives provide a useful, standards-based tool to control vectorization in clang / llvm compiler – while also advancing clang’s compatibility with existing standards.

Personally, I am interested in providing a small set of pragmas/attributes, like the ones that Arnold and Renato proposed, that will allow low-level programmers control the vectorization of loops. The vectorization pragmas that I envision are similar to Intel’s and Cray’s #pragma ivdep/simd. AFAIK it been many decades since the introduction of the ivdep pragmas in Cray’s compiler and I think that this is a good opportunity to add a new syntax that will fit modern C++ and modern vectorization techniques. If we like we can provide aliases to the old terms.

We (Intel) can contribute all required code – implementation is ready, just needs some massaging before submission to clang trunk.

This sounds excellent! Can you provide more details on which pragmas are already implemented?

Thanks,
Nadav

Nadav,

The reaction seems to be positive so far.

OK, we (meaning someone from Intel) are going to prepare and commit support for #pragma omp – using the metadata that Renato introduced.

Andrey

Actually, Nadav, Arnold and only later on, me. :wink:

Thanks!
--renato

Ouch, a typo – this

Nadav, one more thing:

From: "Andrey Bokhanko" <andreybokhanko@gmail.com>
To: "Nadav Rotem" <nrotem@apple.com>
Cc: "cfe-dev" <cfe-dev@cs.uiuc.edu>, "Renato Golin" <renato.golin@linaro.org>, "Hal Finkel" <hfinkel@anl.gov>,
"Alexey Bataev" <a.bataev@gmx.com>, "Douglas Gregor" <dgregor@apple.com>, "Chris Lattner" <clattner@apple.com>,
"Michael Wong" <fraggamuffin@gmail.com>, "Arnold Schwaighofer" <aschwaighofer@apple.com>
Sent: Thursday, February 13, 2014 1:20:16 PM
Subject: Re: Adding "simd" pragma to Clang

Nadav, one more thing:

The vectorization pragmas that I envision are similar to Intel’s and
Cray's #pragma ivdep/simd.

If you are interested in pragma ivdep, we can commit it as well (we
have an implementation internally; it hasn't been committed to
github yet -- as it is not a part of OpenMP).

Please let me know if you want to see it supported.

Are the semantics of your ivdep the same as the simd pragma? Generally speaking, I'm supportive. As I recall, the last time we discussed this, there were real questions by some about what ivdep meant.

-Hal

Current ivdep implementation sets llvm.mem.parallel_loop_access for each
memory instruction in the loop. This can be used by both vectorizer and
other optimizations as well.

simd implementation [will] set vectorizer-specific metadata (force
vectorization, vector width, etc) in addition to parallel_loop_access.

Andrey

From: "Andrey Bokhanko" <andreybokhanko@gmail.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "cfe-dev" <cfe-dev@cs.uiuc.edu>, "Renato Golin" <renato.golin@linaro.org>, "Alexey Bataev" <a.bataev@gmx.com>,
"Douglas Gregor" <dgregor@apple.com>, "Chris Lattner" <clattner@apple.com>, "Michael Wong" <fraggamuffin@gmail.com>,
"Arnold Schwaighofer" <aschwaighofer@apple.com>, "Nadav Rotem" <nrotem@apple.com>
Sent: Friday, February 14, 2014 3:22:15 AM
Subject: Re: Adding "simd" pragma to Clang

Are the semantics of your ivdep the same as the simd pragma?
Generally speaking, I'm supportive. As I recall, the last time we
discussed this, there were real questions by some about what ivdep
meant.

-Hal

Current ivdep implementation sets llvm.mem.parallel_loop_access for
each memory instruction in the loop. This can be used by both
vectorizer and other optimizations as well.

simd implementation [will] set vectorizer-specific metadata (force
vectorization, vector width, etc) in addition to
parallel_loop_access.

Okay, so it sounds like, by default, the answer is that there is no difference. The simd pragma, however, has other options (like specifying the width) that can also be used. This sounds good, but let me be more specific about the concern that had been raised:

The Intel documentation (http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011Update/compiler_c/cref_cls/common/cppref_pragma_ivdep.htm) states that, with ivdep, "Note: The proven dependencies that prevent vectorization are not ignored, only assumed dependencies are ignored." And also, one of the examples points out, "The following loop requires the parallel option in addition to the ivdep pragma to indicate there is no loop-carried dependencies." So, from this, there are two questions:

1. Does our "llvm.mem.parallel_loop_access" metadata represent the implied semantics, which seem to cover "vector dependencies" but not loop-carried dependencies

2. What if there are dependencies that the Intel compiler "proves" by we only "assume"? In this case, we might vectorize (or, more problematic, use the metadata for other purposes, instruction scheduling for instance) in cases where the Intel compiler will ignore the directive because of some dependence it proves.

Personally, I'm less concerned about (2), because it seems silly, at best, to rely on the compiler to ignore your directives by realizing you must have made a mistake. (1) may be more of an issue.

Thoughts?

-Hal

IIRC the end result from the previous discussion was that
as it's not in general possible to know which are "compiler proven dependencies"
(and which only "assumed" due to, e.g., not having a smart enough AA),
we can might as well assume ivdep to have the "embarrassingly parallel
loop" semantics the llvm.mem.parallel_loop_access was intended originally
to denote. Seems Intel came to the same conclusion?

As ivdev semantics is so fuzzy, I'd discourage its usage in favor to the other
better specified pragmas, but might provide it just for the legacy
Intel-optimized codes.

BR,

I agree, plus we can detach the ivdep discussion from the main one.
Would be good to implement it as well, as close as possible to ICC's
semantics, but that is orthogonal to the current discussion.

I think pragmas simd, omp and omp simd are worth implementing as
they're well defined in other contexts. Regarding our own new ones,
pragma Clang or pragma optimize could be easily implemented locally
until we find a better way of doing that.

cheers,
--renato

Folks,

Just a follow-up. I've been discussing this issue on the GCC list, and
it seems that the general feeling is the same: we should avoid old
Intel or custom made pragmas and stick to Cilk/OMP 4 ones. For the
internal ones (unroll, enable) we can use the old Intel ones
(unroll/nounroll, vector/novector), so at least GCC anc ICC would be
able to compile the same code.

Andrey,

The ones we could already plugin to existing metadata are:

Internal ones:
#pragma vector always == llvm.vectorizer.enable 1
#pragma novector == llvm.vectorizer.enable 0
ex: test/Transforms/LoopVectorize/X86/metadata-enable.ll

#pragma unroll N == llvm.vectorizer.unroll N
#pragma nounroll == llvm.vectorizer.unroll 1
ex: test/Transforms/LoopVectorize/metadata-unroll.ll

Cilk:
#pragma simd vectorlength N == llvm.vectorizer.width N
ex: test/Transforms/LoopVectorize/metadata-width.ll

cheers,
--renato

What about trying to extend OMP pragma to cleanly fit your goals? This would give the OMP community something to evaluate and possibly adopt in future revision of the standard. Have you considered this at all? Further - the OMP community seems headed in that direction and has gets input from Intel and other companies/skateholders who really care/know vectorization

Cilk is far from any standard, not used in the real world(???) and fairly tied to Intel. (I'm not apposed to this, but something to think about)

What about trying to extend OMP pragma to cleanly fit your goals?

So, for this first implementation, I'm trying to steer away from
re-inventing the wheel. GCC does implement some of the Cilk/OMP4, and
will continue in that direction, so at least, even if it's old-school,
we'll get similar behaviour.

Cilk is far from any standard, not used in the real world(???) and fairly
tied to Intel. (I'm not apposed to this, but something to think about)

GCC implements some of them.

There's no reason why we can't extend OMP to deal with whatever we
think we should, but that effort is very long term, and would be good
to have something right now working that wouldn't be Clang-specific. I
don't see how that's not good.

Whatever we do in the long term should not hold implementing a
user-visible way to use our current metadata, even if temporary. But
we all know that "temporary" means in software, so I wouldn't like to
invent a "temporary hack" before OMP gets all we need (which I still
don't know what it is, completely). Using old common pragmas seem to
me the cleanest way of doing this.

cheers,
--renato

I'm not a big fan of using gcc as a reference to determine future goals.

I also see no future or large adoption of clik in the real world. afaik Intel contributed it, but I don't know who is actually using it and if the performance is good enough (gcc implementation) to actually offset the efforts involved.

All,

No objections to pragma simd so far, so we decided to go forward with submitting patches – starting with “#pragma omp simd” and then adding “#pragma simd” after that.

First patch is ready for review: http://llvm-reviews.chandlerc.com/D2815. As always, reviewers are very much welcome!

Yours,
Andrey

I'm not a big fan of using gcc as a reference to determine future goals.

I'm not sure you're getting the point. I'm not using GCC as reference
for anything.

I'm just saying that some pragmas already exist, people already use it
and the semantics maps perfectly to our current needs. They will *not*
go away just because we don't implement them.

New OMP pragma discussions should *still* happen and replace the
current, deprecated ones. But that's completely orthogonal to any
legacy feature being implemented as well.

Some clean extension to OMP doesn't have to be long term
1) Your proposal could be implemented now and just used as a POC (proof of
concept) for the working group

1st Law of Compiler Extensions:

Any un-standardized extension will be abused and will outlive their
authors and the standard.

2) It would show clear deficiencies in the standard that have real world
value/usage

We can still do that independently.

By playing nice with OMP now

We can still do that independently.

This would also possibly help avoid duplicate efforts long term, users
having to migrate from one set of pragma to another and bringing people with
interest in solving the problem to help clearly define the behavior.

You're assuming our extension will be accepted as is. Unless your idea
involves discussing them with a large standardization body (OpenMP)
and other compilers (ICC, GCC), I don't think you have great chances
of doing so.

Either un-standardised extensions or deprecated pragmas will have the
same effect, with the difference that deprecated pragmas already
exist.

I don't think the high risk of not getting it perfectly right from the
beginning is worth the trouble of not implementing existing legacy
pragmas.

Lastly - how would this non-omp set of pragma end up playing nice with omp?

The same way it already does in ICC and GCC. And should be solved in
the same way both ICC and GCC will solve when OpenMP 4, 5, 6... comes
out.

Specifically in LLVM, the OMP variants will use the same metadata, so
in theory, they should be absolutely identical, other than syntax.

cheers,
--renato