Why doesn't pragma vectorize warn by default on failure

Right now when writing code like:

#pragma clang loop vectorize(enable) interleave(enable)
while(…) {

}

one has to:

  1. pass the compiler the flags “-Rpass-missed=loop-vectorize -Rpass-analysis=loop-vectorize” to get a warning on all loops where vectorization failed.
  2. grep that output (which might be huge in a moderately sized project) for that particular loop.
  3. Check if vectorization worked.
  4. Remove the compiler flags and recompile to keep working on something else.

This is really bad.

A user specified that a loop has to be vectorized. When I do this, it is my intent that the loop gets vectorized. However, the current workflow discourages users from actually checking that vectorization worked. Furthermore, if months later something on that loop changes that breaks vectorization, this happens silently.

Why are things this way? Can something be done about this?

Could we get these warnings on by default for those loops explicitly marked with the vectorize pragma?

From: "Gonzalo BG" <gonzalobg88@gmail.com>
To: "cfe-dev@cs.uiuc.edu Developers" <cfe-dev@cs.uiuc.edu>
Sent: Wednesday, July 22, 2015 11:29:28 AM
Subject: [cfe-dev] Why doesn't pragma vectorize warn by default on failure

Right now when writing code like:

#pragma clang loop vectorize(enable) interleave(enable)
while(...) {
...
}

one has to:

1) pass the compiler the flags "-Rpass-missed=loop-vectorize
-Rpass-analysis=loop-vectorize" to get a warning on all loops where
vectorization failed.
2) grep that output (which might be huge in a moderately sized
project) for that particular loop.
3) Check if vectorization worked.
4) Remove the compiler flags and recompile to keep working on
something else.

This is really bad.

A user specified that a loop has to be vectorized. When I do this, it
is my intent that the loop gets vectorized. However, the current
workflow discourages users from actually checking that vectorization
worked. Furthermore, if months later something on that loop changes
that breaks vectorization, this happens silently.

Why are things this way? Can something be done about this?

I agree, this seems like a nice idea. For loops that have the pragma, we should enable the pass "missed" informational messages from the vectorizer (unroller, etc.) by default. Patches are obviously welcome -- and/or, please file a bug report at https://llvm.org/bugs/

-Hal

Hi Gonzalo,

Warnings are currently generated when a loop is illegal to vectorize and vectorize(enable) or interleave(enable) is specified. For example,

#pragma clang loop vectorize(enable)

for(int i = 0; i < N; i++) {
if(array[i] > 0) break;
else array[i] = 0;
}

clang -O3 code.c
warning: loop not vectorized: failed explicitly specified loop vectorization [-Wpass-failed]

See http://blog.llvm.org/2014/11/loop-vectorization-diagnostics-and.html for more info.

Its possible there is a bug in the code for generating the warning or we missed generating it somewhere. Could you provide the loop body?

When the loop is not beneficial to vectorize and vectorize(enable) is specified it will vectorize anyway with a width of 2.

Actually I’m a little worried that this could be resulting in inefficient code because people are specifying vectorize(enable) when in-fact they just want the fastest code according to the cost-model. Perhaps we should let the user specify vectorize without an option ‘#pragma loop vectorize’ to say ‘warn me if the structure of the loop cannot be vectorized otherwise select the fastest code according to the cost-model’.

Tyler

Perhaps we should let the user specify vectorize without an option
#pragma loop vectorize’ to say ‘warn me if the structure of the loop
cannot be vectorized otherwise select the fastest code according to the
cost-model’.

This sounds like a great idea.

I would also like to get a full diagnostic about why vectorization failed.
That is, have -Rpass-missed=loop-vectorize and
-Rpass-analysis=loop-vectorize enabled by default.

Its possible there is a bug in the code for generating the warning or we

missed generating it somewhere. Could you provide the loop body?

I provide a minimal example below (in Appendix A). It seems the diagnostic
triggers or not depending on the optimization level selected, which makes
sense now that I looked more into it. For example:

When compiling with O0, O2, O3, and Ofast I get no diagnostic, and in fact,
the loop is not vectorized because it is actually eliminated.
When compiling with O1 I get by default without passing any extra flags the
diagnostic I wanted:

file.cpp:13:1: error: loop not vectorized: failed explicitly specified

loop

     vectorization [-Werror,-Wpass-failed]
}
^

1 error generated.

It is also not hard to get "no diagnostic" even for loops that could be
vectorized (godbolt+assembly: Compiler Explorer). In that example, the
code is vectorized for a size of 100. But changing the size of 8 removes
the vectorization "silently". This is good and makes sense. For a size of 8
it is probably not worth it to vectorize the loop, and whatever cost model
the vectorizer uses knows this.

Thanks all of you for your help. It is greatly appreciated.

Appendix A: source code

#include <iostream>

constexpr int indirection(int *__restrict a, int i) { return a[i]; }

constexpr int foo(int start) {
  int count = start;
  int a[8] = {0, 1, 2, 3, 4, 5, 6, 7};
#pragma clang loop vectorize(enable)
  for (int i = 0; i != 8; ++i) {
    count += indirection(a, i);
  }
  return count;
}

int main() {
  int start = 0;
  std::cin >> start;
  std::cout << foo(start);
  return 0;
}

I would also like to get a full diagnostic about why vectorization failed. That is, have -Rpass-missed=loop-vectorize and -Rpass-analysis=loop-vectorize enabled by default.

That’s a good idea, I think it would solve the problem you are seeing. We should enable the remarks for code with a pragma hint. Thinking about the implementation, I can’t say off the top of my head how to pass that information to the BackendConsumer where remarks are emitted. If I get time I’ll do this, but I’m pretty swamped right now.

Its an interesting example. When SIZE=8 the loop unroller removes the loop. Actually the vectorizer doesn’t even see it! That makes it difficult to generate a diagnostic in the loop vectorizer. And there are many passes that can remove a loop entirely before it gets to the vectorizer causing the “no diagnostic” problem. Some of these passes generate a remark. For example: -Rpass=unroll says the loop in your example was “completely unrolled loop with 8 iterations”. We should extend the diagnostics to all passes that could remove a loop, such as the LoopDelete pass. In combination with your suggestion above that should resolve the “no diagnostics” problem.

Thanks for taking the time to generate the example and think about a solution. It’s a lot of help!

Tyler

Back to this topic, I still have two issues.

First consider a “for_each” like function (this is a minimal example):

template //
[[gnu::flatten]] constexpr auto for_each(int length, F&& f) noexcept {
#pragma clang loop vectorize(enable) interleave(enable)
for (int i = 0; i < length; ++i) f(i);
}

This function will be typically inlined at every point of usage. Depending on the values of length, min, max, the function might be vectorized or not.

However, -Rpass-analysis=loop-vectorize only tells if the loop of the function was vectorized, and points to the loop. In a large program, one will get that the loop is sometimes vectorized and sometimes isn’t. There is no way to traceback in which context this happened. Without this information, there is not much one can do about it. Does it make sense that the function is/isn’t vectorized? It is impossible to know, and it is impossible to “repair” those places in which vectorization failed.

The second issue is that of alignment and restrict. I would like the Rpass-analysis=loop-vectorize flag to tell me when marking one pointer with restrict or when providing the alignment of one pointer or variable would enable vectorization, or even better, if it would tell me when it would enable better code generation (like generating code for a single vectorized version of the loop instead of checking all possible combinations of alignment, aliasing,… at run-time and trashing the instruction cache).