[LLVM] (RFC) Addition/Support of new Vectorization Pragmas in LLVM

The ivdep pragma is designed to do exactly what the name states - ignore
vector dependencies. Cray Research first implemented this in 1978 in
their CFT compiler, and has supported it since.

This pragma is typically used by application developers who want
vectorized code when the compiler cannot automatically determine safety;
it is not equivalent to the OpenMP SIMD pragma in that the compiler is
still expected to automatically detect such things as reductions.

The Cray implementation accepts an optional 'safevl=<const>' clause,
however this is not commonly used and not really needed for new
implementations.

Characteristics of ivdep:

* ivdep can be applied to any loop in a nest, but only affects the
immediately following loop
* Only trims dependencies at that loop nest level
* If the user annotates more than one loop in a loop nest, only
the ivdep on the innermost loop or loops is honored

* ivdep can be used on for (), while {}, and do { } while loops

* Primarily ivdep allows ambiguous dependencies to be ignored, examples:
* p[i] = q[j]
* a[ix[i]] = b[iy[i]]
* a[ix[i]] += 1.0

* ivdep still requires automatic detection of reductions, including
multiple homogeneous reductions on a single variable, examples:
* x = x + a[i]
* x = x + a[i]; if ( c[i] > 0.0 ) { x = x + b[i] }

* ivdep implies loop control variables, loop termination expressions,
and primary induction increment expressions do not alias anything
else in the loop nest body

* For Fortran, ivdep allows the following for array syntax
* Assumption that there is no overlap between the target and right
hand side in assignments
* Assumption there is no overlap between the target and the
address expression for the target

Things that ivdep will /not /do:

* ivdep on a loop nest will not diagnose or reject loops with
'provable' dependencies

* ivdep on a loop nest will not restructure loops as necessary for
correctness
* will not reorder statements for dependence resolution
* will not peel for dependence resolution
* will not perform loop distribution to remove recurrences

* ivdep does not force vector code to be generated; the compiler can
decide to not vectorize the loop nest for any reason it sees fit,
including but not limited to:
* Expectations the scalar version will be faster than the vector
* Non-vectorizable function calls in the loop nest
* Vector operations that lack hardware support
* Extremely unstructured control flow

Hi Terry,

I'm curious.

* Primarily ivdep allows ambiguous dependencies to be ignored, examples:
* p[i] = q[j]
* a[ix[i]] = b[iy[i]]
* a[ix[i]] += 1.0

* ivdep still requires automatic detection of reductions, including
multiple homogeneous reductions on a single variable, examples:
* x = x + a[i]
* x = x + a[i]; if ( c[i] > 0.0 ) { x = x + b[i] }

How do you define the difference between
  a[ix[i]] += 1.0
and
  x += 1.0
as you require reduction detection for the latter but seem to ignore the
(histogram) reduction dependences for the former.

Thanks,
  Johannes

I think some of the semantics could be implemented using the
"llvm.mem.parallel_loop_access" annotation we already have, modulo the
difficulties mentioned below.

   * Primarily ivdep allows ambiguous dependencies to be ignored, examples:
       * p[i] = q[j]
       * a[ix[i]] = b[iy[i]]
       * a[ix[i]] += 1.0

"ambiguous dependencies" is very vague. Does it mean the compiler has
to do some analysis to detect non-ambiguous dependencies?

When using "llvm.mem.parallel_loop_access", this would mean the
front-end would have to detect which accesses are non-ambiguous and
not annotate them. However, the annotation is for single accesses, not
dependencies. Both "p[i]" and "q[j]" look non-ambiguous individually,
but the vectorizer would have to add a runtime-check and loop
versioning to ensure that these are not aliasing.

   * ivdep still requires automatic detection of reductions, including
     multiple homogeneous reductions on a single variable, examples:
       * x = x + a[i]
       * x = x + a[i]; if ( c[i] > 0.0 ) { x = x + b[i] }

We could leave away the "llvm.mem.parallel_loop_access" for the
LoadInst and StoreInst of the reduction variable, assuming detected
reductions are limited over scalar variables. However, mem2reg/sroa
would remove those memory accesses anyway, including their annotation,
requiring the LoopVectorizer to detect that the resulting PHINode is a
reduction. Mem2reg/sroa/LICM would also do so with non-reductions, and
array elements that are promoted to registers during the execution of
the loop, such that the loop would not be vectorizable.

Michael

The ivdep pragma is intended to assist automatic vectorization -
basically, automatic vectorization behaves as it normally does, but if
it gets into a spot where it finds a potential dependence, it continues
on rather than rejecting the loop.

Reductions are part of cycle-breaking; one possible way to identify
potential reduction objects is that their address is provably invariant
with respect to the vectorized loop. Some examples (assume 'i' is the
loop primary induction):

x = x + b[i]
&x is invariant with respect to the 'i' vector loop, and can be a
reduction candidate

a[0] = a[0] + b[i]
&a[0] is invariant with respect to the 'i' vector loop, and can be a
reduction candidate

a[ix[i]] = a[ix[i]] + b[i]
&a[ix[i]] varies with respect to the 'i' vector loop, and is not a
reduction candidate
* I am ignoring the end case here where all values of ix[i] are
identical
* Without an ivdep, this would be considered a histogram (or
what Cray used to call 'vector update'), due to possible repeated values
in array 'ix'
* With an ivdep, this becomes a gather, load and add followed by
a scatter

When outer loop vectorization is considered, identifying vector
reductions becomes somewhat more complicated, and the simple invariant
address test is not always sufficient. Examples on request, though that
is really a different (and possibly lengthy) discussion.

I realize this is probably out of scope but I wanted to put this out
here if people consider to add ivdep.

The ivdep pragma is intended to assist automatic vectorization -
basically, automatic vectorization behaves as it normally does, but if
it gets into a spot where it finds a potential dependence, it continues
on rather than rejecting the loop.

Reductions are part of cycle-breaking; one possible way to identify
potential reduction objects is that their address is provably invariant
with respect to the vectorized loop. Some examples (assume 'i' is the
loop primary induction):

x = x + b[i]
&x is invariant with respect to the 'i' vector loop, and can be a
reduction candidate

a[0] = a[0] + b[i]
&a[0] is invariant with respect to the 'i' vector loop, and can be a
reduction candidate

a[ix[i]] = a[ix[i]] + b[i]
&a[ix[i]] varies with respect to the 'i' vector loop, and is not a
reduction candidate
* I am ignoring the end case here where all values of ix[i] are
identical
* Without an ivdep, this would be considered a histogram (or
what Cray used to call 'vector update'), due to possible repeated values
in array 'ix'
* With an ivdep, this becomes a gather, load and add followed by
a scatter

When outer loop vectorization is considered, identifying vector
reductions becomes somewhat more complicated, and the simple invariant
address test is not always sufficient. Examples on request, though that
is really a different (and possibly lengthy) discussion.

The ignored case is what I was asking about. Basically, I was wondering
how you explain when reduction dependences are needed to be resolved by
the vectorizer and when you ignore potential reduction dependences.

While I can see the appeal towards users, I dislike something like
"their address provably invariant". Provably invariant can change
depending on the surrounding, the analyses, and the transformations
applied. Let's say ix is, after some transformations, known to be a
uniform array, reduction handling becomes required, but if the
uniformity is not exposed before the vectorizer, the potential
dependences will be ignored, right?. I mean, if it worked for the user
and then something unrelated screwed up replacement of the ix[i] values
with constants (in the user code or in the compiler), then it is hard to
debug.

Do you see what I mean? Please let me know if I misunderstand how Cray
handles ivdep though.

Cheers,
  Johannes

My intent was to present the expected behavior for an "ivdep" pragma
implementation, rather than diving into the implementation details -
that seems like it should be another thread.

That said, trying to predict in the front end what edges will eventually
cause difficulties with automatic vectorization does seem problematic.
Generally "ivdep" is an assist to automatic vectorization; for older
Cray compilers that basically means the front end does nothing but pass
along the "ivdep" property, and dependency analysis for vectorization
uses that property directly.

One thing to remember is that is perfectly valid for the "ivdep" loop
nest to still be rejected as a vector candidate for any reason, so
support for an "ivdep" pragma could be implemented in stages if desired.

Terry

That said, trying to predict in the front end what edges will eventually
cause difficulties with automatic vectorization does seem problematic.
Generally "ivdep" is an assist to automatic vectorization; for older
Cray compilers that basically means the front end does nothing but pass
along the "ivdep" property, and dependency analysis for vectorization
uses that property directly.

This is what makes implementing ivep with Cray's semantics difficult.
To be compatible, we'd need to replicate Cray's cycle breaking.
Missing a detected reduction means ignoring its dependency cycle and
therefore a miscompilation where Cray's vectorizer might have produced
correct code (and the other way around). Unpredictably miscompiling
programs is probably not what users would expect.

One thing to remember is that is perfectly valid for the "ivdep" loop
nest to still be rejected as a vector candidate for any reason, so
support for an "ivdep" pragma could be implemented in stages if desired.

The vectorizer rejecting any "ivdep" loop that has unbroken dependency
cycles makes the annotation useless. We'd need to have a description
of dependencies that any Cray compiler (including past and future
versions) will ignore (instead of breaking by e.g. reduction
detection) with ivdep such that Clang never miscompiles a loop that a
Cray compiler compiles correctly.

Michael