Introduction
Convolution is a very relevant operation for machinelearning applications. We developed a new codegeneration approach to computing convolutions in CPU that fits into the MLIR ecosystem and improves performance when compared to the common Im2Col + GEMM approach. This approach was presented in the Open MLIR Meeting 20230309, and this post has the objective of getting feedback on both general viability to be upstreamed and specific questions.
Compilation Flow
Our algorithm is composed of three compilation passes with different purposes:
 Convolution Slicing Analysis (CSA): an analysis pass responsible for determining a suitable tiling strategy for each convolution and the target hardware.
 Convolution Slicing Optimization (CSO): a lowering pass that generates an optimized implementation to solve a convolution using CSAâ€™s tiling strategy.
 Filter Packing (optional): if the convolution filters are static and available at compile time, this data transformation pass can rearrange the filter data at that point to save time.
The following image shows the compilation flow for a machinelearning application. The implementation generated by CSO includes a cachetiling loop nest (macrokernel), an inputtensor packing routine and an optimized architecturespecific microkernel.
Convolution Slicing Analysis (CSA)
The analysis pass determines the tiling strategy based on hardware and convolution information. This tiling strategy is composed of a tile size, a tile distribution over the cache hierarchy, and a scheduling decision.
In CSA, both the input tensor and filter set are tiled on the channel dimension. The microkernel information is used for the other dimensions. Thus, CSA calculates the number of channels in each tile based on cache information, so that one tile of each type (filter, input and output) fit in the L1 cache simultaneously.
The tile distribution over the cache hierarchy is the number of tiles that are stored at once in each cache level, to maximize reuse between tiles that have already been loaded to cache. This is enforced by the loading order in the macrokernel. Both tile size and the distribution values are calculated using a heuristic.
The scheduling decision between Input Stationary and Weight Stationary refers to which tile type (input or filter) is stored in each cache level (L2 and L3). This is determined by a cost model that takes into account all data movement during the convolution process.
The full detailed explanation can be found in the Further Information section at the end of this post.
Convolution Slicing Optimization (CSO)
The lowering pass is responsible for generating the optimized implementation using the CSAcalculated tiling values. The resulting MLIR code is divided into three parts: Loop Nest, MicroKernel and Packing. Similar to optimized tiled GEMM implementations, the microkernel and packing routines are architecturespecific and deeply connected.
MacroKernel Loop Nest
The cachetiling macrokernel that iterates over all tiles so that the CSAdetermined tile distribution over the cache hierarchy is respected.
MicroKernel
The microkernel is responsible for the arithmetic operations that form the convolution, computing a single output tile from a filter set tile and an input tensor tile. The best way of performing these operations vary based on what is available on the target architecture (e.g. larger vector registers, hardwareimplemented outerproduct), so it should be lowered separately.
Our prototype uses an outerproductbased microkernel, but the algorithm has the freedom of using what is best for each architecture. Outer product is a useful operation because it is a rank1 update with high density, producing nÂ˛ outputs from 2n inputs, and is being incorporated into new architectures as ISA extensions such as IBM POWER10 MMA and Apple AMX. When unavailable in hardware, outer products can be emulated in any architecture with SIMD computing support.
A matrixmultiplication microkernel is often the best way to compute a series of multiplications and accumulations for the target architecture, so they can be used with no modifications, repurposed by the convolution packing. Therefore, the same microkernel can be used for matmul operation lowering and is a good candidate to be separated into its own operation. This is another advantage of our approach that can help lowering to different architectures.
Indepth descriptions of the microkernelâ€™s operation and access pattern can be found in the Further Information section at the end of this post.
Packing
The packing routine is the connection between the tiling process and the microkernel. It extracts a tile from the original input tensor or filter set, and rearranges the data so that the microkernel can read sequentially from memory. In the process, the data is loaded to cache. This connection to the microkernel leads to the packing layout also being architecturespecific.
However, outerproductbased microkernels have similar access patterns, which often leads to the following packing layout:
Each input tensor tile contains multiple (N_{win}) windows (projections of a filter onto the input) and each filter set tile contains multiple filters. In both cases, the elements from the same index in every window/filter are packed sequentially.
In a machinelearning context, the filters may be static and part of the compilation process. Therefore, the filter packing step can be done as a data transformation pass, as previously stated.
VectorBased Input Packing
As evidenced by the previous image, packing the input tensor for an outerproductbased microkernel requires an expansion of the input tensor into windows, so that the microkernel may compute a block of output elements at once.
In convolutions with stride 1, there are overlapping elements between windows, so there is data replication when packing. To avoid reading the same data multiple times, an insight can be used: excluding edge cases, the next element of the current window is the current element of the next window. Considering the established packing layout, a set of N_{win} elements can be shifted to the left with a serial in element to obtain the following set. If this can be done at the register level, the data is only read once.
Each instruction set might achieve this result with a different combination of instructions, but the Vector dialect allows for a generic implementation using the shuffle
operation to extract elements from 2 vectors.
%5 = vector.shuffle %3, %4 [1, 2, 3, 4] : vector<4xf32>, vector<4xf32>
Current Prototype
Our approach is implemented in the ONNXMLIR machinelearning framework (not upstreamed). The Conv
operation of the ONNX dialect is lowered to the Affine, Memref, Vector and Arith dialects by CSO. The microkernel is currently implemented as an external library.
Experimental Results
The developed prototype was tested with full machinelearning model inference in the ONNXMLIR framework against the traditional Im2Col + GEMM method with the OpenBLAS GEMM routine. Every result is the average of 100 measurements, with low variability, on an IBM POWER10 E1050 (P10) and an Intel Xeon Silver 4208 CPU (x86).
Convolution Speedup is the performance improvement considering the convolutions in each model, and Model Speedup is the endtoend inference performance improvement. The difference between both results is explained by the Convolution Time Share of each model using ONNXMLIR.
Deeper results and discussion can be found in the Further Information section at the end of the post.
Questions

This algorithm can be used for any convolution, but was conceived and tested in a machinelearning context. Furthermore, it requires NCHW tensor layout. What dialect level would be the best fit to implement it?
a. Starting in thelinalg
dialect is a possibility, from thelinalg.conv_2d_nchw_fchw
operation. 
The microkernel needs to also be generated by CSO.
a. Thellvm.matrix.multiply.*
intrinsic can be used, but it would be a deviation from the MLIR ecosystem.
b. Is there an MLIRcentric alternative for the microkernel? Perhaps amatmul
microkernel or equivalent. 
For VectorBased Packing, should the vector size be dependent on the architecture? E.g. 4element fp32 vectors for 128bit vector register architectures.
a. If so, what would be the best way to handle architectures with larger vector registers such as Intel AVX512? 
CSA needs hardware information for tiling (cache sizes, etc). Is there a standard MLIR data structure that contains that information?
a. If so, how do I access it?
Discussion on any issues found and suggestions on how to better integrate this process with the current MLIR ecosystem (e.g. linalg
) are welcome.
Further Information
A more detailed overview of the algorithm and implementation snippets can be found in the Open MLIR Meeting 20230309 post, in both slide and video form. A full explanation on how it works, more indepth results and discussion are available in our paper on the subject.