Vector Dialect Ops for Intel Intrinsics - Optimizing linalg.copy

I am working on optimizing the performance of linalg.copy (for n-D transposition), based on the HPTT library [1]. It decomposes a multidimensional tensor into two-dimensional bxb macrotiles and wxw microtiles (where w is the vector-width and b=4w). The microtiles are computed by a microkernel that does in-register transposition with explicit vectorization. I am attempting to rewrite in MLIR (using the vector ops, after lowering from linalg to vector dialect) this microkernel specified for single-precision elements and AVX-512. This microkernel mainly involves the following intel vector intrinsics:

  1. _mm256_loadu_ps (float const * mem_addr) [2]
  2. _mm256_unpacklo_ps (__m256 a, __m256 b) [3]
  3. _mm256_unpackhi_ps (__m256 a, __m256 b) [4]
  4. _mm256_shuffle_ps (__m256 a, __m256 b, const int imm8) [5]
  5. _mm256_permute2f128_ps (__m256 a, __m256 b, int imm8) [6]
  6. _mm256_storeu_ps (float * mem_addr, __m256 a) [7]
    (Please refer to the microkernel implementation here: [8], lines 139-225)

I have been able to use MLIR’s vector.extract_slices and vector.instert_slices() for intrinsics 1 and 6 respectively. I am wondering if the existing MLIR ops could be used to implement the other intrinsics 2-5. I am uncertain if vector.shuffle() or the other vector ops would be sufficient, would I have to write new vector dialect ops to do this?

Any suggestions/feedback would be helpful.

I am also interested in knowing if there are any similar efforts for optimizing multidimensional transposition in MLIR.

Thanks!
-Mahesh

Cc: @MaheshRavishankar

Ref:
[1] Springer, Paul, Tong Su, and Paolo Bientinesi. “HPTT: A high-performance tensor transposition C++ library.” In Proceedings of the 4th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming, pp. 56-62. 2017.

[2] software.intel.com/sites/landingpage/IntrinsicsGuide/#text=mm256_loadu_ps&expand=4980,6114,5200,3410,3410,3410,3410

[3] software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_unpacklo_ps&expand=3410,4980,3410,6114,5200,6114,6057,6114

[4] software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_unpackhi_ps&expand=3410,4980,3410,6114,5200,6114,6057,6114,6057

[5] software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_shuffle_ps&expand=3410,4980,3410,6114,6114,6057,5200,5200

[6] software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_permute2f128_ps&expand=3410,4980,3410,6114,5200,6114,4172

[7] software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_storeu_ps&expand=3410,4980,3410,6114,5200,6114,5650

[8] The HPTT microkernel implementation: github.com/springer13/hptt/blob/master/src/transpose.cpp#L134

As a general strategy, I would suggest you add ops that correspond exactly to these intrinsics into the avx512 dialect (and its llvm_avx512 counterpart, although these two may get merged soon). Then we can generalize the desirable behavior of these ops to the Vector dialect if necessary. Vector dialect tends to be more high level than most machine instructions and supports, e.g., multi-dimensional vectors,.

1 Like

Hi @maheshl,

+1 to what @ftynse suggested. Are you using the LLVM x86 backend or something else? If you wanted to keep the approach a bit more portable, maybe you could check if the LLVM backend is able to generate what you want out of regular vector shuffles. I would expect that for #2, #3 and #4. Not so sure about #5. You could write a simple LLVM or MLIR test with the vector shuffles describing your permutations, compile it for AVX512 and check the generated assembly.

1 Like

Thanks @ftynse and @dcaballe.
I have been trying to use the MLIR vector ops so far. I shall write tests with LLVM shufflevector to see if I can generate code similar to intel intrinsics. Based on that, I can add necessary ops to the avx512/llvm_avx512 dialect.

Please also have a look at some earlier docs on the vector dialect where we talk about keeping the vector dialect as the proper bridge between the “virtual vector level” and the “hardware vector level”, using progressive lowering. As stated above, we want to avoid that the vector dialect itself becomes too close to the hardware too soon.