We worked on performance portable Stencil computation across CPUs and GPUs by leveraging a specific flavor of multidimensional vectors[1]. We hope our code generation strategy could be integrated into MLIR to enable performance portable code through MLIR instead of our custom code generator.
The motivation of our vector abstraction stems from that hardware vectors are sometimes too long for efficient cache blocking for stencil-like applications when one-point in space requires values from multiple neighboring points. For higher-order stencils that help reduce numerical errors, these “reach” can be very large, causing TLB and cache issues. YASK[2] thus introduced this notion of folded vectors [P.S.] that helps with the performance of such stencil-computations.
In our abstraction, a vector is naturally multidimensional where vector<4x4xf32> maps to a single vector on AVX512. And thus they are stored contiguously in memory. The layout transformation is supported by our “brick” library in C++. Elementwise operations and broadcast are supported on such platforms. Align operation implemented using “_mm512_alignr_epi32” allows vectors to be blended to create the neighboring access effect intrinsic to stencils. The align operates on two same virtually-shaped vectors by concatenating them along the dimension - dim, and shift down by the specified amount then take the results by slicing out the shape of the operands.
%2 = vector.align %0, %1, 1 (dim), 2 (amount): vector<4x4xf32> -> vector<4x4xf32>
On GPUs, each warp can be viewed as a vector and the align operation can be realized using warp shuffle (__shfl_down_sync).
This abstraction with elementwise operations and 2 type of intrinsics allowed us to support a wide range of stencil applications and achieved good performance portability[3].
We are investigating if it is possible to implement the same type of vector abstraction into MLIR.
If I’ve read correctly, the current vector abstraction in MLIR is more of a collection of 1D vectors when generating code. I’m interested in implementing a “flattening”-pass from operations on vectors of <4x4x4xf32> to <4x16xf32> this requires some change in the casting during load/store ops such as
a) MemRef<?xvector<4x4x4xf32>> to MemRef<?xvector<4x16xf32>>
or
b)
MemRef<?x64xf32>> to MemRef<?xvector<4x16xf32>>
MemRef<?x4x4x4xf32>> to MemRef<?xvector<4x16xf32>>
or
c) Vector<4x4x4xf32> to Vector<4x16xf32>
c) will probably generate weird code as 4x4x4xf32 means 4x4xvector<4xf32> when lowering to llvm.
Thus I’m interested if a or b is possible. Or other ways that I can read such vector data. I haven’t figured out how to rewrite all the element-wise operations yet. So this might not be such a good idea after all.
Any suggestions as to how we should proceed?
Tuowen
[1] Zhao, Tuowen, et al. “Exploiting reuse and vectorization in blocked stencil computations on CPUs and GPUs.” Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis . 2019.
[2] Yount, Charles, et al. “YASK—Yet another stencil kernel: A framework for HPC stencil code-generation and tuning.” 2016 Sixth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC) . IEEE, 2016.
[3] Zhao, Tuowen, et al. “Delivering Performance-Portable Stencil Computations on CPUs and GPUs Using Bricks.” 2018 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC) . IEEE, 2018.
P.S. there are vectors that are naturally multidimensional. AVX2 have vectors in the shape of vector<2x4xf32>. vperm2f128 allows align along 2 and vpalignr allows align along 4. Or if consider tensor-core operand as a vector, though they don’t support align using tensor core.