[RFC] Do we want to move forward the auto outer loop vectorization?

Dear all,

Problem statement

Let’s start with one simple example:

typedef unsigned long long i64;
extern int a[1024][1024];
extern int a2[1024];

void foo(int k, i64 m, i64 n)
{
  int i1, i2;

#pragma clang loop vectorize(enable) vectorize_width(4)
  for (i1 = 0; i1 < i1End; i1++) {
    a2[i1] = i1;
    for (i2 = 0; i2 < i2End; i2++)
      a[i2][i1] = i1 + k;
  }
}

The target is to vectorize the outer loop (iv = i1). The current support is to use VPlan native path, and it requires the option -enable-vplan-native-path enabled and the pragma ‘vectorize’ semantics, and other restrictions.

I submitted one issue #60879 two months ago, but no one responded. I guess people are more interested in production usage of the outer loop vectorization considering there are a lot of issues of the outer loop vectorization. So, the auto outer loop vectorization is important. With production usage, more people can be attracted to refine it.

Why do I propose this RFC?

  1. Push forward the work for auto outer loop vectorization.

  2. I want to know if someone could help review the code if some patch is submitted. I also want to know with whose accept, the patch can be landed.

  3. I work in Classic Flang and LLVM Flang. Fortran is colomn-major. It’s common to write the code that the outer loop vectorization can benefit. I want to add some support.

Plan

  1. Relax the canonical loop to the stride-one loop. D147951 (Please help review it if someone is interested.)
  2. Add more restrictions in legality analysis if there is no -enable-vplan-native-path enabled and the pragma ‘vectorize’ such as inner loop cannot be vectorized and loop body are array index violating row-major for c/c++. Some simple performance investigation is here([VPlan][OuterLoop] Performance investigation between inner loop and outer loop optimization · Issue #62065 · llvm/llvm-project · GitHub). In this step, the aim is to vectorize the outer loop automatically, which may need a lot of restrictions. Simple loop is common in Fortran, so it is worth to do it for us. But the option -enable-vplan-native-path and the pragma ‘vectorize’ path is still kept for debuging.
  3. Add one rough cost model excluding comparing the cost of inner loop vectorization and outer loop vectorization. The if-conversion is also not included.
  4. Support the scalable vector starting with simple case. The gather-scatter analysis can be supported gradually.
  5. Support the interleave.
3 Likes

Hi,

thanks for reaching out! I’ll take a look in a bit. Current work as mostly focusing on evolving the VPlan infrastructure based on the needs of the inner loop vectorizer, but I think it might be worth starting to think about how to make progress on the outer loop part again as well.

The current outer-loop path is not very well tested unfortunately and I think in order to move forward a first step would be to start with adding some end-to-end test coverage to GitHub - llvm/llvm-test-suite to make sure we have a better change of catching issues early.

A good place to start would probably be adding some initial tests to llvm-test-suite/SingleSource/UnitTests/Vectorizer at main · llvm/llvm-test-suite · GitHub using the pragmas + -enable-vplan-native-path. @PeixinQiao would that be something you’d be interested in helping out with?

Cheers,
Florian

Hi @fhahn ,

Thanks for being interested for this.

I agree with your idea for starting with some end-to-end tests. I tested the correctness locally when I relax the canonical loop to the stride-one loop. But I don’t want to add the test results in one issue or somewhere everytime when submitting one new patch. I will check how to add some tests to llvm-test-suite/SingleSource/UnitTests/Vectorizer at main · llvm/llvm-test-suite · GitHub.

1 Like

Outerloop vectorization currently has lots of limitations and issues. I also have started to look at it again
and do have a fix to support a scalable vectors for RISC-V. I plan to put it under review this week if I won’t see any big problems on x86 or ARM

From my perspective the plan to attack it should be following:

  1. Add Data Dependency Analysis for outer loops. This is quite important part since otherwise ‘forced’ clause has different meaning for inner and outer loop vectorizations aka behaves like omp-simd.
  2. Add memory access analysis. Right now gathers and scatters are generated even if memory access is unit-strided. I’m not sure which target you used for experiments, but on some targets gathers are pretty expensive so vector code won’t be efficient

These 2 are essential to at least start to promote hinted outerloop vectorization to users and they can be done in parallel.


  1. Explicit interleaving for VLS or ARM VLA

Since we talk about moving it to automatic mode, there is a necessity to add:


  1. Driver/Planner to traverse loopnest in some order: top-down or bottom-up. Also, clarify how hints should behave there.
  2. Cost model.

  1. There should be a unified way to build and represent VPlan for innermost and outer loops
  2. Other improvements.

That sounds great, please let me know if you need any further pointers on this!

Hi Kolya, could you share a couple of C/C++ examples where the current Dependence Analysis does not work for outer loop vectorization.

The simplest one is

void bad_dependency(int n, int a[1024][1024], int b[1024][1024])
{
  #pragma clang loop vectorize(enable) vectorize_width(4)
  for (int i = 1; i < n; i++) {
    for (int j = 0; j < n; j++) {
        a[j][i] = a[j][i-1] + b[i][j];
    }
  }
}

indexing is intentionally reversed to highlight point 2 made earlier:

  %21 = getelementptr inbounds [1024 x i32], ptr %1, <4 x i64> %20, <4 x i64> %18
  %22 = call <4 x i32> @llvm.masked.gather.v4i32.v4p0(<4 x ptr> %21, i32 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x i32> poison), !dbg !13, !tbaa !14
  %23 = getelementptr inbounds [1024 x i32], ptr %2, <4 x i64> %17, <4 x i64> %20, !dbg !13
  %24 = call <4 x i32> @llvm.masked.gather.v4i32.v4p0(<4 x ptr> %23, i32 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x i32> poison), !dbg !18, !tbaa !14
  %25 = add nsw <4 x i32> %24, %22, !dbg !19
  %26 = getelementptr inbounds [1024 x i32], ptr %1, <4 x i64> %20, <4 x i64> %17, !dbg !19
  call void @llvm.masked.scatter.v4i32.v4p0(<4 x i32> %25, <4 x ptr> %26, i32 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>), !dbg !20, !tbaa !14

similarly, if pointers may overlap, current outerloop vectorization won’t generate any rt-checks.

Hello all,

I have also played around with the outer-loop vectorizer recently, and just like @nikolaypanchenko, in order to make it work with scalable vectors. It works, but a problem I have is that currently, the VPlan-native Path (required for outer-loop vectorization) is limited to vectorization factors that are a power of 2 by an assertion, which is unenforceable for scalable vectors. I removed the assert and nothing crashes, but I guess someone placed the assert there for a reason. This comment explains that there are problems if a IV starting at zero overflows if it does not increase by a power-of-2 per iteration, but the inner-loop vectorizer handles this in a way that should be applicable for outer-loop vectorization as well. Is anyone aware of other reasons?

As @nikolaypanchenko also mentioned, the VPlan-native path currently skips a lot of analysis steps, most notably the LoopAccessAnalysis. It currently bails out super early if the analysed loop is not the inner most one (See here in LoopAccessAnalysis.cpp). Removing this line, the LoopAccessAnalysis does not crash when executed on some outer loops I tested it with, but I cannot say with confidence that the results are all correct a.t.m…

Using ScalarEvolution, I managed to write a VPlanTransforms-style transformation that marks some continous memory accesses as consecutive instead of always generating gather/scatter instructions like the VPlan-native path currently does, but proper LoopAccessAnalysis with proper Dependency- and Alias-Analysis for outer loops would be much much much better.

I also toyed around with a alternative to VPlanTransforms::VPInstructionsToVPRecipes that is a bit smarter about which instructions to vectorize (for example do not widen address calculation of a consecutive load) and is capable of very limited reductions, but not having LoopAccessAnalysis and not having as much information in LoopVectorizationLegality (the outer-loop vectorizer/VPlan-native codepath there is also very different from the one regular inner-loop vectorization takes) makes everything quite tricky (especially with regards to correctness/legality).

I only started looking into all of this recently and I am relatively new to LLVM. Any help is welcome, and I may also help on VPLAN-native path if work on it picks up again.

Best regards!

Can we use explisit command line flag for enabling loop vectorizer instead of compiler directive #pragma?

And also one more thing i have noticed is , assembly generated by vplan with sve is using uzp1 instructions which is operating on same source vector twice…does it makes any sense??