LLVM Loop Vectorizer puzzle

Hi,
I have the llvm loop vectorizer to complie the following sample:
//=================

int test(int *a, int n) {

for(int i = 0; i < n; i++) {

a[i] += i;

}

return 0;

}

//================
The corresponded .ll file has a loop preheader:
//================

for.body.lr.ph: ; preds = %entry
%cnt.cast = zext i32 %n to i64
%0 = and i32 %n, 7
%n.mod.vf = zext i32 %0 to i64
%n.vec = sub i64 %cnt.cast, %n.mod.vf
%cmp.zero = icmp eq i32 %0, %n
br i1 %cmp.zero, label %middle.block, label %vector.body
//================

That is , if n <=7, the program will skip the vector.body. In LoopVectorize.cpp , I see the following code:
//================

static cl::opt
TinyTripCountVectorThreshold(“vectorizer-min-trip-count”, cl::init(16), … );
//================

The minimal loop count is 16. What is the “n<=7” meaning? Thanks.

Hi,

The TinyTripCountVectorThreshold only applies to loops with a known (constant) trip count. If a loop has a trip count below this value we don’t attempt to vectorize the loop. The loop below has an unknown trip count.

Once we decide to vectorize a loop, we emit code to check whether we can execute one iteration of the vectorized body. This is the code quoted below.

Hi,

Just from personal interest, is there a canonical way in IR+metadata to express “This small constant trip-count loop is desired to be converted into a sequence of vector operations directly”? Ie, mapping a 4 element i32 loop into a linear sequence of <4 x i32> operations. Obviously this may not always be a win, but I’m just wondering if there’s a way to communicate this intent and get around the vectorizer-min-trip-count in specially desired cases, or if I should decide to generate vectorized IR directly. (This is in code coming from a DSL which will impliciltly insert annotations, not manually written loops.)

Cheers,

Dave

Hi,

Just from personal interest, is there a canonical way in IR+metadata
to express "This small constant trip-count loop is desired to be
converted into a sequence of vector operations directly"? Ie,
mapping a 4 element i32 loop into a linear sequence of <4 x i32>
operations. Obviously this may not always be a win, but I'm just
wondering if there's a way to communicate this intent and get around
the vectorizer-min-trip-count in specially desired cases, or if I
should decide to generate vectorized IR directly. (This is in code
coming from a DSL which will impliciltly insert annotations, not
manually written loops.)

I think that the answer is: not currently. On the other hand, if the loop is small enough to get unrolled, then you can either use the BB vectorization pass or the SLP vectorization pass to vectorize it.

-Hal

Yes.

I would like us to grow a few annotations, among others, one to force vectorization irrespective whether the loop vectorizer thinks it is beneficial or not - however, this is future music.

Isn't that part of the ivdep implementation? I thought there was support
for that already...

--renato

No, llvm.loop.parallel only communicates information about memory dependencies (or there absence of) and the loop vectorizer only uses it for this. I don’t think we should give it additional semantics of forcing vectorization.

Of course, you could locally patch llvm to abuse it for other purposes...

(Note, I have not formed a strong opinion on this yet, these are just some initial thoughts, I am not convinced yet that the attributes below are the right set of attributes, or that the syntax is right :wink:

I am thinking of something like:

llvm.vectorization.<param><value>

where which would allow us to safety and optimization parameters from the front end:

- Safety:
#pragma vectorize [max_iterations <NUM>]

For vectorization we might want to have an optional parameters at which distance vectorization is safe:
#pragma vectorize max_iterations 8
would indicate that vectorization up to a distance 8 is safe. This would restrict the combinations of VF and unroll factor the vectorizer is allowed to choose.

- Parameters controlling the vectorizer optimization choices:
width, unroll factor, force vectorization at Os, don’t vectorize

#pragma vectorize width 4 unroll 2
Forces VF=4 and unroll=2

#pragma vectorize max_iterations 8
Allows the vectorizer to choose.

#pragma vectorize off
Disable vectorization.

#pragma vectorize force

If we decide, that

#pragma ivdep

should imply forced vectorization (which I am not sure it should), the front-end can than in addition to the llvm.loop.parallel metadata, emit meta data to force vectorization. But, I don’t think we should overload the semantics of llvm.loop.parallel.

I'm not sure that ivdep should "force vectorization" either. My
interpretation of this pragma is that it tells the compiler not to consider
"assumed" dependencies. "Proven" dependencies are still valid, which can
prevent vectorization.

I apologize in advance if this seems nit-picky.

-Cameron

I would like us to grow a few annotations, among others, one to force vectorization irrespective whether the loop vectorizer thinks it is beneficial or not - however, this is future music.

Isn't that part of the ivdep implementation? I thought there was support for that already...

No, llvm.loop.parallel only communicates information about memory dependencies (or there absence of) and the loop vectorizer only uses it for this. I don’t think we should give it additional semantics of forcing vectorization.

Of course, you could locally patch llvm to abuse it for other purposes...

I was recently thinking about how to extend the parallel loop metadata to support other hints. Does it make sense to use a single loop id metadata and attach hints to it?

For example, here is a simple loop with llvm.loop.parallel and llvm.mem.parallel_loop_access metadata:

loop.body: ; preds = %loop.body, %loop.body.lr.ph
  %indvars.iv = phi i64 [ %4, %loop.body.lr.ph ], [ %indvars.iv.next, %loop.body ]
  %__index.addr.07 = phi i32 [ %__low, %loop.body.lr.ph ], [ %7, %loop.body ]
  %ref1 = load i32*** %3, align 8, !llvm.mem.parallel_loop_access !0
  %5 = load i32** %ref1, align 8, !llvm.mem.parallel_loop_access !0
  %arrayidx = getelementptr inbounds i32* %5, i64 %indvars.iv
  %6 = trunc i64 %indvars.iv to i32
  store i32 %6, i32* %arrayidx, align 4, !llvm.mem.parallel_loop_access !0
  %indvars.iv.next = add i64 %indvars.iv, 1
  %7 = add i32 %__index.addr.07, 1
  %exitcond = icmp eq i32 %7, %__high
  br i1 %exitcond, label %loop.end, label %loop.body, !llvm.loop.parallel !0

If I want to add metadata for the vector length how should it look? One thing that would be nice is not having to check branches for different types of loop metadata. How about changing llvm.loop.parallel to llvm.loop and making the hints child nodes?

e.g.,

br i1 %exitcond, label %loop.end, label %loop.body, !llvm.loop !0

...

!0 = metadata !{ metadata !1, metadata !2 }
!1 = metadata !{ metadata !"llvm.loop.parallel" }
!2 = metadata !{ metadata !"llvm.vectorization.vector_width", i32 8 }

I'm not even sure you would need the llvm.loop.parallel anymore since the vectorizer could just look to see if the loop id on a parallel_loop_access matches the loop id of the loop being vectorized.

Does this make any sense?

If we decide, that

#pragma ivdep

should imply forced vectorization (which I am not sure it should), the front-end can than in addition to the llvm.loop.parallel metadata, emit meta data to force vectorization. But, I don’t think we should overload the semantics of llvm.loop.parallel.

ivdep doesn't force vectorization. It just says if you can't prove there is or isn't a dependency the assume there isn't.

paul

!0 = metadata !{ metadata !1, metadata !2 }
!1 = metadata !{ metadata !“llvm.loop.parallel” }
!2 = metadata !{ metadata !“llvm.vectorization.vector_width”, i32 8 }

I’m not even sure you would need the llvm.loop.parallel anymore since the vectorizer could just look to see if the loop id on a parallel_loop_access matches the loop id of the loop being vectorized.

Does this make any sense?

Yes. It makes sense to me.

If we decide, that

#pragma ivdep

should imply forced vectorization (which I am not sure it should), the front-end can than in addition to the llvm.loop.parallel metadata, emit meta data to force vectorization. But, I don’t think we should overload the semantics of llvm.loop.parallel.

ivdep doesn’t force vectorization. It just says if you can’t prove there is or isn’t a dependency the assume there isn’t.

I think that we should come up with a better name. I am okay with providing ICC aliases, but I think that we should come up with slightly less cryptic names for clang.

In all fairness, I do not believe that ivdep is an ICC-specific pragma.
There are many compilers that support ivdep and lots of legacy (and modern)
codes that benefit from it. Seems silly, to me at least, to reinvent the
wheel.

If I'm not mistaken, ivdep predates ANSI C. Also if I'm not mistaken, ivdep
originated at Cray... way before vectors were cool. :wink:

-Cameron

In all fairness, I do not believe that ivdep is an ICC-specific pragma.
There are many compilers that support ivdep and lots of legacy (and modern)
codes that benefit from it. Seems silly, to me at least, to reinvent the
wheel.

I agree.

If I'm not mistaken, ivdep predates ANSI C. Also if I'm not mistaken, ivdep

originated at Cray... way before vectors were cool. :wink:

That's not right, Cray *made* vectors cool, quite literally actually. :wink:

cheers,
--renato

Yes. However, I think you still need use the self-referencing
metadata trick or similar to make the metadata identifying a loop unique,
though (to avoid merging it with the metadata nodes with the same data). That
is, e.g., the llvm.mem.parallel_loop_access has to refer to *the* original
loop, not just any llvm.loop metadata with the same child metadata.

On dropping the llvm.loop.parallel metadata and relying only on checking the parallel_loop_access to identify parallel loops, I'm not so sure. Does it
retain all the info for all cases? Let's say you have a parallel loop without
memory accesses but, say, a volatile inline asm block. In that case you do not
have a way to communicate that the iterations in the said loop can be executed in any order if you cannot mark the loop itself parallel.

Regards,

Hi Cameron,

The history of the idvep pragma is fascinating. I did not know that it predated ANSI. People who care about cray compatibility should provide aliases for #ivdep. The name “ivdep” is simply terrible. There is no good reason not to come up with a syntax that is actually meaningful to people. I like arnold’s idea to add additional options so the ‘vector’ pragma.

Thanks,
Nadav

I'm not even sure you would need the llvm.loop.parallel anymore since the
vectorizer could just look to see if the loop id on a parallel_loop_access
matches the loop id of the loop being vectorized.

Does this make any sense?

Yes. However, I think you still need use the self-referencing
metadata trick or similar to make the metadata identifying a loop unique,
though (to avoid merging it with the metadata nodes with the same data). That
is, e.g., the llvm.mem.parallel_loop_access has to refer to *the* original
loop, not just any llvm.loop metadata with the same child metadata.

On dropping the llvm.loop.parallel metadata and relying only on checking the parallel_loop_access to identify parallel loops, I'm not so sure. Does it
retain all the info for all cases? Let's say you have a parallel loop without
memory accesses but, say, a volatile inline asm block. In that case you do not
have a way to communicate that the iterations in the said loop can be executed in any order if you cannot mark the loop itself parallel.

In this case, can't you just add llvm.mem.parallel_loop_access to the "call void asm ..." Instruction in the IR?

paul

I'm not even sure you would need the llvm.loop.parallel anymore since the
vectorizer could just look to see if the loop id on a parallel_loop_access
matches the loop id of the loop being vectorized.

Does this make any sense?

Yes. However, I think you still need use the self-referencing
metadata trick or similar to make the metadata identifying a loop unique,
though (to avoid merging it with the metadata nodes with the same data). That
is, e.g., the llvm.mem.parallel_loop_access has to refer to *the* original
loop, not just any llvm.loop metadata with the same child metadata.

So it should look like:

!0 = metadata !{ metadata !0, metadata !1, metadata !2 }
!1 = metadata !{ metadata !"llvm.loop.parallel" }
!2 = metadata !{ metadata !"llvm.vectorization.vector_width", i32 8 }

Correct?

paul

Yep. And yes, I think one can just add the parallel_loop_access MD
also to volatile inline asm calls and other instructions which might
prevent parallelization. At least I do not quickly see a case that could
break.

In all fairness, I do not believe that ivdep is an ICC-specific pragma.
There are many compilers that support ivdep and lots of legacy (and modern)
codes that benefit from it. Seems silly, to me at least, to reinvent the
wheel.

Hi Cameron,

The history of the idvep pragma is fascinating. I did not know that it
predated ANSI. People who care about cray compatibility should provide
aliases for #ivdep. The name “ivdep” is simply terrible. There is no good
reason not to come up with a syntax that is actually meaningful to people.
I like arnold’s idea to add additional options so the ‘vector’ pragma.

Clang is (still) way behind the curve when it comes to expressing
parallelism. Most other compilers support this out of the box, and there is
a large body of code that uses such pragmas, be they "nice" or "cryptic".
It's unfortunate that there is no ANSI or equivalent standard for ivdep,
but the compiler manuals combined plus large bodies of existing code
provide a de-facto community standard.

If you are looking for a real standard to follow -- and that would be much
better than defining a clang-specific extension -- then I suggest OpenMP's
"#pragma omp simd". See e.g. <
c++ - Parallel for vs omp simd: when to use each? - Stack Overflow,
which discusses multi-threading vs. vectorization, or <
Intel Developer Zone

.

-erik

I don't disagree. :wink:

I think we should try to make #ivdep as close as possible to whatever the
original (or currently most used) meaning is, while enhancing and extending
the notion, where #ivdep can simply be a special case.

--renato