Introduction
Convolution is a very relevant operation for machine-learning applications. We developed a new code-generation approach to computing convolutions in CPU that fits into the MLIR ecosystem and improves performance when compared to the common Im2Col + GEMM approach. This approach was presented in the Open MLIR Meeting 2023-03-09, and this post has the objective of getting feedback on both general viability to be upstreamed and specific questions.
Compilation Flow
Our algorithm is composed of three compilation passes with different purposes:
- Convolution Slicing Analysis (CSA): an analysis pass responsible for determining a suitable tiling strategy for each convolution and the target hardware.
- Convolution Slicing Optimization (CSO): a lowering pass that generates an optimized implementation to solve a convolution using CSA’s tiling strategy.
- Filter Packing (optional): if the convolution filters are static and available at compile time, this data transformation pass can rearrange the filter data at that point to save time.
The following image shows the compilation flow for a machine-learning application. The implementation generated by CSO includes a cache-tiling loop nest (macro-kernel), an input-tensor packing routine and an optimized architecture-specific micro-kernel.
Convolution Slicing Analysis (CSA)
The analysis pass determines the tiling strategy based on hardware and convolution information. This tiling strategy is composed of a tile size, a tile distribution over the cache hierarchy, and a scheduling decision.
In CSA, both the input tensor and filter set are tiled on the channel dimension. The micro-kernel information is used for the other dimensions. Thus, CSA calculates the number of channels in each tile based on cache information, so that one tile of each type (filter, input and output) fit in the L1 cache simultaneously.
The tile distribution over the cache hierarchy is the number of tiles that are stored at once in each cache level, to maximize reuse between tiles that have already been loaded to cache. This is enforced by the loading order in the macro-kernel. Both tile size and the distribution values are calculated using a heuristic.
The scheduling decision between Input Stationary and Weight Stationary refers to which tile type (input or filter) is stored in each cache level (L2 and L3). This is determined by a cost model that takes into account all data movement during the convolution process.
The full detailed explanation can be found in the Further Information section at the end of this post.
Convolution Slicing Optimization (CSO)
The lowering pass is responsible for generating the optimized implementation using the CSA-calculated tiling values. The resulting MLIR code is divided into three parts: Loop Nest, Micro-Kernel and Packing. Similar to optimized tiled GEMM implementations, the micro-kernel and packing routines are architecture-specific and deeply connected.
Macro-Kernel Loop Nest
The cache-tiling macro-kernel that iterates over all tiles so that the CSA-determined tile distribution over the cache hierarchy is respected.
Micro-Kernel
The micro-kernel is responsible for the arithmetic operations that form the convolution, computing a single output tile from a filter set tile and an input tensor tile. The best way of performing these operations vary based on what is available on the target architecture (e.g. larger vector registers, hardware-implemented outer-product), so it should be lowered separately.
Our prototype uses an outer-product-based micro-kernel, but the algorithm has the freedom of using what is best for each architecture. Outer product is a useful operation because it is a rank-1 update with high density, producing n² outputs from 2n inputs, and is being incorporated into new architectures as ISA extensions such as IBM POWER10 MMA and Apple AMX. When unavailable in hardware, outer products can be emulated in any architecture with SIMD computing support.
A matrix-multiplication micro-kernel is often the best way to compute a series of multiplications and accumulations for the target architecture, so they can be used with no modifications, repurposed by the convolution packing. Therefore, the same micro-kernel can be used for matmul operation lowering and is a good candidate to be separated into its own operation. This is another advantage of our approach that can help lowering to different architectures.
In-depth descriptions of the micro-kernel’s operation and access pattern can be found in the Further Information section at the end of this post.
Packing
The packing routine is the connection between the tiling process and the micro-kernel. It extracts a tile from the original input tensor or filter set, and rearranges the data so that the micro-kernel can read sequentially from memory. In the process, the data is loaded to cache. This connection to the micro-kernel leads to the packing layout also being architecture-specific.
However, outer-product-based micro-kernels have similar access patterns, which often leads to the following packing layout:
Each input tensor tile contains multiple (Nwin) windows (projections of a filter onto the input) and each filter set tile contains multiple filters. In both cases, the elements from the same index in every window/filter are packed sequentially.
In a machine-learning context, the filters may be static and part of the compilation process. Therefore, the filter packing step can be done as a data transformation pass, as previously stated.
Vector-Based Input Packing
As evidenced by the previous image, packing the input tensor for an outer-product-based micro-kernel requires an expansion of the input tensor into windows, so that the micro-kernel may compute a block of output elements at once.
In convolutions with stride 1, there are overlapping elements between windows, so there is data replication when packing. To avoid reading the same data multiple times, an insight can be used: excluding edge cases, the next element of the current window is the current element of the next window. Considering the established packing layout, a set of Nwin elements can be shifted to the left with a serial in element to obtain the following set. If this can be done at the register level, the data is only read once.
Each instruction set might achieve this result with a different combination of instructions, but the Vector dialect allows for a generic implementation using the shuffle
operation to extract elements from 2 vectors.
%5 = vector.shuffle %3, %4 [1, 2, 3, 4] : vector<4xf32>, vector<4xf32>
Current Prototype
Our approach is implemented in the ONNX-MLIR machine-learning framework (not upstreamed). The Conv
operation of the ONNX dialect is lowered to the Affine, Memref, Vector and Arith dialects by CSO. The micro-kernel is currently implemented as an external library.
Experimental Results
The developed prototype was tested with full machine-learning model inference in the ONNX-MLIR framework against the traditional Im2Col + GEMM method with the OpenBLAS GEMM routine. Every result is the average of 100 measurements, with low variability, on an IBM POWER10 E1050 (P10) and an Intel Xeon Silver 4208 CPU (x86).
Convolution Speedup is the performance improvement considering the convolutions in each model, and Model Speedup is the end-to-end inference performance improvement. The difference between both results is explained by the Convolution Time Share of each model using ONNX-MLIR.
Deeper results and discussion can be found in the Further Information section at the end of the post.
Questions
-
This algorithm can be used for any convolution, but was conceived and tested in a machine-learning context. Furthermore, it requires NCHW tensor layout. What dialect level would be the best fit to implement it?
a. Starting in thelinalg
dialect is a possibility, from thelinalg.conv_2d_nchw_fchw
operation. -
The micro-kernel needs to also be generated by CSO.
a. Thellvm.matrix.multiply.*
intrinsic can be used, but it would be a deviation from the MLIR ecosystem.
b. Is there an MLIR-centric alternative for the micro-kernel? Perhaps amatmul
micro-kernel or equivalent. -
For Vector-Based Packing, should the vector size be dependent on the architecture? E.g. 4-element fp32 vectors for 128-bit vector register architectures.
a. If so, what would be the best way to handle architectures with larger vector registers such as Intel AVX-512? -
CSA needs hardware information for tiling (cache sizes, etc). Is there a standard MLIR data structure that contains that information?
a. If so, how do I access it?
Discussion on any issues found and suggestions on how to better integrate this process with the current MLIR ecosystem (e.g. linalg
) are welcome.
Further Information
A more detailed overview of the algorithm and implementation snippets can be found in the Open MLIR Meeting 2023-03-09 post, in both slide and video form. A full explanation on how it works, more in-depth results and discussion are available in our paper on the subject.