[MLIR] Multidimensional vector abstraction

We worked on performance portable Stencil computation across CPUs and GPUs by leveraging a specific flavor of multidimensional vectors[1]. We hope our code generation strategy could be integrated into MLIR to enable performance portable code through MLIR instead of our custom code generator.

The motivation of our vector abstraction stems from that hardware vectors are sometimes too long for efficient cache blocking for stencil-like applications when one-point in space requires values from multiple neighboring points. For higher-order stencils that help reduce numerical errors, these “reach” can be very large, causing TLB and cache issues. YASK[2] thus introduced this notion of folded vectors [P.S.] that helps with the performance of such stencil-computations.

In our abstraction, a vector is naturally multidimensional where vector<4x4xf32> maps to a single vector on AVX512. And thus they are stored contiguously in memory. The layout transformation is supported by our “brick” library in C++. Elementwise operations and broadcast are supported on such platforms. Align operation implemented using “_mm512_alignr_epi32” allows vectors to be blended to create the neighboring access effect intrinsic to stencils. The align operates on two same virtually-shaped vectors by concatenating them along the dimension - dim, and shift down by the specified amount then take the results by slicing out the shape of the operands.

%2 = vector.align %0, %1, 1 (dim), 2 (amount): vector<4x4xf32> -> vector<4x4xf32>

On GPUs, each warp can be viewed as a vector and the align operation can be realized using warp shuffle (__shfl_down_sync).

This abstraction with elementwise operations and 2 type of intrinsics allowed us to support a wide range of stencil applications and achieved good performance portability[3].

We are investigating if it is possible to implement the same type of vector abstraction into MLIR.

If I’ve read correctly, the current vector abstraction in MLIR is more of a collection of 1D vectors when generating code. I’m interested in implementing a “flattening”-pass from operations on vectors of <4x4x4xf32> to <4x16xf32> this requires some change in the casting during load/store ops such as

a) MemRef<?xvector<4x4x4xf32>> to MemRef<?xvector<4x16xf32>>


MemRef<?x64xf32>> to MemRef<?xvector<4x16xf32>>
MemRef<?x4x4x4xf32>> to MemRef<?xvector<4x16xf32>>


c) Vector<4x4x4xf32> to Vector<4x16xf32>

c) will probably generate weird code as 4x4x4xf32 means 4x4xvector<4xf32> when lowering to llvm.
Thus I’m interested if a or b is possible. Or other ways that I can read such vector data. I haven’t figured out how to rewrite all the element-wise operations yet. So this might not be such a good idea after all.

Any suggestions as to how we should proceed?


[1] Zhao, Tuowen, et al. “Exploiting reuse and vectorization in blocked stencil computations on CPUs and GPUs.” Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis . 2019.
[2] Yount, Charles, et al. “YASK—Yet another stencil kernel: A framework for HPC stencil code-generation and tuning.” 2016 Sixth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC) . IEEE, 2016.
[3] Zhao, Tuowen, et al. “Delivering Performance-Portable Stencil Computations on CPUs and GPUs Using Bricks.” 2018 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC) . IEEE, 2018.

P.S. there are vectors that are naturally multidimensional. AVX2 have vectors in the shape of vector<2x4xf32>. vperm2f128 allows align along 2 and vpalignr allows align along 4. Or if consider tensor-core operand as a vector, though they don’t support align using tensor core.

Hello @ztuowen

I do not have the cycles to go into the docs you pointed at this time but the use cases you describe seem to be covered by the MLIR Vector Dialect.

For the cast you point to, this op should give you everything you need as soon as it lands: https://reviews.llvm.org/D73635

Note however that it is not yet connected to lowering to LLVM, you are welcome to either:

  • pick it up and get familiar with the MLIR lowering/conversion infrastructure, or
  • ask for it to be implemented, (ideally by providing a test case and a description of what you expect as an output).

Either way, please let us know whether the analysis about your use cases is correct.
Please also file a bug for the VectorOps Dialect here.

Thank you!

Thanks, we’ll look into it.

@aartbik and @andydavis1 for visibility.

Actually, no. The vector.shape_cast doesn’t appear to deal with a surrounding memref and changing its shape as well but the OP is asking for that in (a) and (b).

@ztuowen What you need for the memref part is pretty much the memref_shape_cast here. There is also the LLVM lowering for it. A similar abstraction doesn’t exist upstream but you are welcome to extend this and contribute.

@bondhugula how is Andy’s vector.shape_cast addition not a perfect fit for case c) ?

c) vector<4x4x4xf32> to vector<4x16xf32>

There are always multiple ways of solving a particular problem and this is my current recommendation given all the current moving pieces.

Of course, if this is happens to not be enough for what @ztuowen needs, I am happy to reconsider my recommendation based on new information and see how this will fit in the overall strategy.

The OP is asking for (a), (b), not c) AFAIU. Yes, c) is of course vector.shape_cast. See below.

I’ll try that as well. Shape cast also changes the Memref data structure right? Will it be better if all statically known dimensions are directly codegened with constants, so that reshapes like these wouldn’t need to touch the MemRef struct?