Steps towards generalizing vectorization in Affine

Thanks for starting this thread Diego and sorry for the delay, I was out last week.
I think you put the finger on some key composability limitations.

The main problem I see is that vector.transfer is a high-level op (which you also mention in your last post) but for some reason, 1-D affine.vector_load/store “raises” to it on the way to LLVM. I understand this was done for expediency but it seems to me this is the root cause of your immediate problem. I would look into 2 things:

  1. refactoring the lowering of affine to std to avoid going through vector.transfer.
  2. have a legalization pattern that takes 1-D vector.transfer ops with agreeable (permutation map + alignment + layout) and rewrite those to affine.vector operations. You can then apply those patterns greedily. If you use a post-pass that comes immediately after vectorization, this could grow to legalize more things for affine analyses and transformations, in the future.

Absolutely, the affine dialect has some conflations and limitations, here is a small laundry list for such an ecosystem that I would welcome:

  1. AffineExpr/AffineMap (resp. affine.apply/min/max) are not affine types (resp. operations) but really std/arith types (resp. operations) that apply to arbitrary index and behave differently for “symbols” and “dims” (for affine.apply, dims compose, symbols concatenate). What SSA values “symbols” and “dims” are allowed to bind to in a particular transformation setting is completely orthogonal. Longer term, this hints at a new arith dialect that can be expanded with polynomial, log/exp/shift (for recursive, FFT, scans etc).
  2. most other affine dialect operations are just std operations whose operands are the result of arith.affine_apply/min/max with extra dim/symbol operands requirements related to AffineScope. Using the std form for these ops with additional arith.affine_apply/min/max analysis/canonicalizations would trigger the following:
    a. shake up missing canonicalization infra (e.g. std.addiarith.affine_apply, nested loop bounds canonicalization, …).
    b. mix affine and non-affine indexing within a given load/store like op: it should be pretty easy to extend dependence analysis to support some dependences that are partially *-distance (Allen & Kennedy notation) and relax existing violated dependence analysis checks (e.g. here).
    c. longer-term, remove duplicated “affine-only” IR, whose existence is a byproduct of MLIR initially only being able to represent affine.load/store/for/if. Arguably, this would require a bit of “raising”/analysis to make arith.affine_apply/min/max more prominent, but this is simple in the overall picture.
  3. extend and generalize transformations that have been implemented pessimistically in “affine only” form to also work more optimistically with only subsets of arith.affine_apply operands.

There should be no back-and-forth: vector.transfer is a higher-level op than affine.vector_load.

I’d say it’s even more isolated than vector->affine because of the narrow use case that @dcaballe has: in the short term, we only want to legalize 1-D vector.transfer ops.

Note that in the case of interest to Diego, legalize 1-D vector.transfer would suffice to unblock him.
But generally +1 to such a legalization pass. In the grander picture of arith.affine_apply/min/max ops described above, do you foresee significantly more than complex problems than proper canonicalization of these 3 ops + std.addi/muli/constant/... and some simple traversal to detect preconditions for transformations?

Not yet AFAIK but I don’t foresee major issues. I’d say it’s mostly missing canonicalizations into arith.affine_apply/min/max. For the short-term interop. you seek, I don’t think you’d even need those?

I have some trouble understanding what this would look like, could you maybe give some examples?
It seems to me this goes against the other parts because:

  1. for your immediate problem you only need simple patterns for 1-D vector.transfer
  2. if targeting better interplay between affine analysis and other dialects, vector->affine would seem to increase the footprint of “affine-only” abstractions.

vector.transfer_read/write is a generic higher-level abstraction, that works for n-D vectors and composes well with many other vector-level abstractions and progressive lowering. In the particular corner case of interest here (1-D + powers of 2 vectors + vector size that divides the problem size) they can easily lower to something that affine can understand today. The fact that affine does not understand anything else than “affine-only” is a deeper problem unrelated to future-proof vectorization abstractions.

The real problem is that affine.vector_load/store lowers to vector.transfer, which is a higher-level abstraction, on the way to LLVM. The correct path should be vector.transfer → canonicalize 1-D affine.vector → LLVM (probably via std).

Given the example listed in https://reviews.llvm.org/D85885 :

%AV = memref_vector_cast %A : memref<?x?xf32> to memref<?x?xvector<8xf32>>

It is still unclear to me what happens when either:

  1. the 1-D vector length is not a power of 2 (or whatever alignment the data layout requires) or,
  2. ? % 8 != 0 for the second memref<?x?xf32> dimension: parts of the data seems to just not be addressable?
  3. ? % 8 < 0 for the second memref<?x?xf32> dimension: seems like undefined behavior.

It is trivial to canonicalize from the higher-level vector.transfer to the lower-level subset that affine can deal with. Targeting a high-level abstraction + (very) progressively has been quite successful in MLIR so I am not sure why we’d want to target a lower level abstraction.

Indeed, the vector dialect is not limited by affine-only constraints and neither should the transformation. It turns out the implementation was limited by the IR we had at the time and it is probably time to refresh some of this. In any case, I’ll reiterate that legalizing vector.transfer for the corner case that affine analyses support is a trivial rewrite.

Yeah I’m not super satisfied with this name either, naming is hard. The initial feedback that I was sympathetic to, was that people were expecting vectorization to be 1-D only and that this was too confusing so I went for “supervectorizer” drawing inspiration from “superscalar”. Happy to revert back if this is now imprinted in people’s mind that MLIR vectors and vectorization are a multi-1D-vector abstraction.
In any case, let’s absolutely not use “tile”: it is one the most overloaded term in HPC compilers and hardware and algorithms… (otherwise should the transformation be called “tiling” :man_shrugging: ?)

This is a step further in the wrong direction: vector.transfer is still higher-level than affine.vector_load/store with N>1-D vectors and will continue to evolve this way to support gather/scatter, sparse, canonicalization/folding with other vector ops (unrolling, insert/extract, contract, transpose, future generic op, etc). Also, “tile”, really ? :slight_smile:

Agreed.

Agreed, to break the back-and-forth cycle you mentioned, but the other way around :slight_smile: vector.transfer is still a higher level abstraction than affine.vector_load/store or .

Yes and they are going to continue evolving towards higher-level abstractions (e.g. sparse).

Absolutely, there has already been quite some work on progressive “Vector → Vector” lowering + canonicalization + folding.

In the fullness of time, we should indeed expose more of those + cost models. Mundane constraints such as timing and number of hands on deck get in the way… As a side note, in the particular case of memory + broadcast, it turns out that LLVM does a great job at converting std.load/std.store + std.insert/std.extract + the resulting lowering of (vector.contractvector.outerproductvector.fmallvm.fmuladd) into “broadcast from memory” form (see the first assembly listing on slide 42 here):

13: 62 f2 5d 58 b8 46 01 vfmadd231ps zmm0,zmm4,DWORD PTR [rsi+0x4]{1to16}
1a: 62 f1 54 58 59 4e 20 vmulps zmm1,zmm5,DWORD PTR [rsi+0x80]{1to16}
21: 62 f2 5d 58 b8 4e 21 vfmadd231ps zmm1,zmm4,DWORD PTR [rsi+0x84]{1to16}

It seems to me that adding ops with “affine-only” semantics to the vector dialect would only displace interoperability concerns into the vector dialect. So far I have been pretty happy with vector ops that compose and I am not looking at breaking that. If simplicity is what you are after and vector.transfer is too high-level, there is also std.load/std.store but of course this does not resolve the “affine dialect composition” problem so you’d still need a legalization pattern. However, the ? % 8 != 0 case still needs to be handled somehow.

Yes, we indeed have this behavior and continuing to add more progressive canonicalization and lowering to the vector dialect.

Now irrespective of the points above, there is still the fact that this pass has been written quite some time ago and needs to be refreshed, especially in light of linalg → vector needs and the upcoming sparse needs. The affine parts (affine.for dependence analysis + affine.for rewrite + affine.load/store) can be decoupled from the op vectorization part. The op vectorization part can be reused in a) affine, b) Linalg and c) a potential scf.parallel_for + std.load/std.store if someone is interested in that. In fact some of this has been written out of core for Linalg and should be unified.

One progressive way of going could be to introduce a vector.generic op with structured ops semantics that has been discussed in the past (slide 48 shows vector.contract) and use it as a common ground. Would that sound useful to people?

For Diego’s immediate blocker, the 2 steps (that I’ll rewrite here) should do:

  1. refactoring the lowering of affine to std to avoid going through vector.transfer.
  2. have a legalization pattern that takes 1-D vector.transfer ops with agreeable (permutation map + alignment + layout) and rewrite those to affine.vector operations.

Thoughts?