Dear all,
Problem statement
Let’s start with one simple example:
typedef unsigned long long i64;
extern int a[1024][1024];
extern int a2[1024];
void foo(int k, i64 m, i64 n)
{
int i1, i2;
#pragma clang loop vectorize(enable) vectorize_width(4)
for (i1 = 0; i1 < i1End; i1++) {
a2[i1] = i1;
for (i2 = 0; i2 < i2End; i2++)
a[i2][i1] = i1 + k;
}
}
The target is to vectorize the outer loop (iv = i1). The current support is to use VPlan native path, and it requires the option -enable-vplan-native-path
enabled and the pragma ‘vectorize’ semantics, and other restrictions.
I submitted one issue #60879 two months ago, but no one responded. I guess people are more interested in production usage of the outer loop vectorization considering there are a lot of issues of the outer loop vectorization. So, the auto outer loop vectorization is important. With production usage, more people can be attracted to refine it.
Why do I propose this RFC?
-
Push forward the work for auto outer loop vectorization.
-
I want to know if someone could help review the code if some patch is submitted. I also want to know with whose accept, the patch can be landed.
-
I work in Classic Flang and LLVM Flang. Fortran is colomn-major. It’s common to write the code that the outer loop vectorization can benefit. I want to add some support.
Plan
- Relax the canonical loop to the stride-one loop. D147951 (Please help review it if someone is interested.)
- Add more restrictions in legality analysis if there is no
-enable-vplan-native-path
enabled and the pragma ‘vectorize’ such as inner loop cannot be vectorized and loop body are array index violating row-major for c/c++. Some simple performance investigation is here([VPlan][OuterLoop] Performance investigation between inner loop and outer loop optimization · Issue #62065 · llvm/llvm-project · GitHub). In this step, the aim is to vectorize the outer loop automatically, which may need a lot of restrictions. Simple loop is common in Fortran, so it is worth to do it for us. But the option-enable-vplan-native-path
and the pragma ‘vectorize’ path is still kept for debuging. - Add one rough cost model excluding comparing the cost of inner loop vectorization and outer loop vectorization. The if-conversion is also not included.
- Support the scalable vector starting with simple case. The gather-scatter analysis can be supported gradually.
- Support the interleave.
- …