[RFC] Optimized Convolution for MLIR

Introduction

Convolution is a very relevant operation for machine-learning applications. We developed a new code-generation approach to computing convolutions in CPU that fits into the MLIR ecosystem and improves performance when compared to the common Im2Col + GEMM approach. This approach was presented in the Open MLIR Meeting 2023-03-09, and this post has the objective of getting feedback on both general viability to be upstreamed and specific questions.

Compilation Flow

Our algorithm is composed of three compilation passes with different purposes:

  1. Convolution Slicing Analysis (CSA): an analysis pass responsible for determining a suitable tiling strategy for each convolution and the target hardware.
  2. Convolution Slicing Optimization (CSO): a lowering pass that generates an optimized implementation to solve a convolution using CSA’s tiling strategy.
  3. Filter Packing (optional): if the convolution filters are static and available at compile time, this data transformation pass can rearrange the filter data at that point to save time.

The following image shows the compilation flow for a machine-learning application. The implementation generated by CSO includes a cache-tiling loop nest (macro-kernel), an input-tensor packing routine and an optimized architecture-specific micro-kernel.

Convolution Slicing Analysis (CSA)

The analysis pass determines the tiling strategy based on hardware and convolution information. This tiling strategy is composed of a tile size, a tile distribution over the cache hierarchy, and a scheduling decision.

In CSA, both the input tensor and filter set are tiled on the channel dimension. The micro-kernel information is used for the other dimensions. Thus, CSA calculates the number of channels in each tile based on cache information, so that one tile of each type (filter, input and output) fit in the L1 cache simultaneously.

The tile distribution over the cache hierarchy is the number of tiles that are stored at once in each cache level, to maximize reuse between tiles that have already been loaded to cache. This is enforced by the loading order in the macro-kernel. Both tile size and the distribution values are calculated using a heuristic.

The scheduling decision between Input Stationary and Weight Stationary refers to which tile type (input or filter) is stored in each cache level (L2 and L3). This is determined by a cost model that takes into account all data movement during the convolution process.

The full detailed explanation can be found in the Further Information section at the end of this post.

Convolution Slicing Optimization (CSO)

The lowering pass is responsible for generating the optimized implementation using the CSA-calculated tiling values. The resulting MLIR code is divided into three parts: Loop Nest, Micro-Kernel and Packing. Similar to optimized tiled GEMM implementations, the micro-kernel and packing routines are architecture-specific and deeply connected.

Macro-Kernel Loop Nest

The cache-tiling macro-kernel that iterates over all tiles so that the CSA-determined tile distribution over the cache hierarchy is respected.

Micro-Kernel

The micro-kernel is responsible for the arithmetic operations that form the convolution, computing a single output tile from a filter set tile and an input tensor tile. The best way of performing these operations vary based on what is available on the target architecture (e.g. larger vector registers, hardware-implemented outer-product), so it should be lowered separately.

Our prototype uses an outer-product-based micro-kernel, but the algorithm has the freedom of using what is best for each architecture. Outer product is a useful operation because it is a rank-1 update with high density, producing n² outputs from 2n inputs, and is being incorporated into new architectures as ISA extensions such as IBM POWER10 MMA and Apple AMX. When unavailable in hardware, outer products can be emulated in any architecture with SIMD computing support.

A matrix-multiplication micro-kernel is often the best way to compute a series of multiplications and accumulations for the target architecture, so they can be used with no modifications, repurposed by the convolution packing. Therefore, the same micro-kernel can be used for matmul operation lowering and is a good candidate to be separated into its own operation. This is another advantage of our approach that can help lowering to different architectures.

In-depth descriptions of the micro-kernel’s operation and access pattern can be found in the Further Information section at the end of this post.

Packing

The packing routine is the connection between the tiling process and the micro-kernel. It extracts a tile from the original input tensor or filter set, and rearranges the data so that the micro-kernel can read sequentially from memory. In the process, the data is loaded to cache. This connection to the micro-kernel leads to the packing layout also being architecture-specific.

However, outer-product-based micro-kernels have similar access patterns, which often leads to the following packing layout:

Each input tensor tile contains multiple (Nwin) windows (projections of a filter onto the input) and each filter set tile contains multiple filters. In both cases, the elements from the same index in every window/filter are packed sequentially.

In a machine-learning context, the filters may be static and part of the compilation process. Therefore, the filter packing step can be done as a data transformation pass, as previously stated.

Vector-Based Input Packing

As evidenced by the previous image, packing the input tensor for an outer-product-based micro-kernel requires an expansion of the input tensor into windows, so that the micro-kernel may compute a block of output elements at once.

In convolutions with stride 1, there are overlapping elements between windows, so there is data replication when packing. To avoid reading the same data multiple times, an insight can be used: excluding edge cases, the next element of the current window is the current element of the next window. Considering the established packing layout, a set of Nwin elements can be shifted to the left with a serial in element to obtain the following set. If this can be done at the register level, the data is only read once.

image

Each instruction set might achieve this result with a different combination of instructions, but the Vector dialect allows for a generic implementation using the shuffle operation to extract elements from 2 vectors.

%5 = vector.shuffle %3, %4 [1, 2, 3, 4] : vector<4xf32>, vector<4xf32>

Current Prototype

Our approach is implemented in the ONNX-MLIR machine-learning framework (not upstreamed). The Conv operation of the ONNX dialect is lowered to the Affine, Memref, Vector and Arith dialects by CSO. The micro-kernel is currently implemented as an external library.

Experimental Results

The developed prototype was tested with full machine-learning model inference in the ONNX-MLIR framework against the traditional Im2Col + GEMM method with the OpenBLAS GEMM routine. Every result is the average of 100 measurements, with low variability, on an IBM POWER10 E1050 (P10) and an Intel Xeon Silver 4208 CPU (x86).

Convolution Speedup is the performance improvement considering the convolutions in each model, and Model Speedup is the end-to-end inference performance improvement. The difference between both results is explained by the Convolution Time Share of each model using ONNX-MLIR.

Deeper results and discussion can be found in the Further Information section at the end of the post.

Questions

  1. This algorithm can be used for any convolution, but was conceived and tested in a machine-learning context. Furthermore, it requires NCHW tensor layout. What dialect level would be the best fit to implement it?
    a. Starting in the linalg dialect is a possibility, from the linalg.conv_2d_nchw_fchw operation.

  2. The micro-kernel needs to also be generated by CSO.
    a. The llvm.matrix.multiply.* intrinsic can be used, but it would be a deviation from the MLIR ecosystem.
    b. Is there an MLIR-centric alternative for the micro-kernel? Perhaps a matmul micro-kernel or equivalent.

  3. For Vector-Based Packing, should the vector size be dependent on the architecture? E.g. 4-element fp32 vectors for 128-bit vector register architectures.
    a. If so, what would be the best way to handle architectures with larger vector registers such as Intel AVX-512?

  4. CSA needs hardware information for tiling (cache sizes, etc). Is there a standard MLIR data structure that contains that information?
    a. If so, how do I access it?

Discussion on any issues found and suggestions on how to better integrate this process with the current MLIR ecosystem (e.g. linalg) are welcome.

Further Information

A more detailed overview of the algorithm and implementation snippets can be found in the Open MLIR Meeting 2023-03-09 post, in both slide and video form. A full explanation on how it works, more in-depth results and discussion are available in our paper on the subject.

4 Likes

Thanks for your great ODM and for sharing your work.
This is quite timely and I think there is a good opportunity to integrate in MLIR, I am looking forward to it.

We have specifically refrained from building cost models and advanced heuristics upstream until now focusing our efforts on abstractions and composability. So there is lots of room for improvement :slight_smile: It would be great to see your existing (or proposed) implementation for this layer.

We have quite some abstractions and transformations to automate most of this process once the strategy is determined. I did a post recently about some possible e2e entry points in the system.
This entry point is tied to IREE which comes with quite some advantages (e.g. we can just run on mobile or on GPU when ready) but @makslevental has also been at making this more standalone with a nice python API and lightweight execution engine (see the ODM from yesterday).

We could add another example target at the convs you want and build the basic BLIS-like strategy to see what we are missing to achieve what you have (I’ll give a few cents in my next replies).

One of the aspect of the abstractions we built in MLIR is that operations often “resist” transformations and do not lose their “names”. Even when they do, they remain high-level enough that it can be recovered quite easily (especially if we reduce to something that looks like a contraction). I expect you’ll find the toolbox to meet most of your needs here.

We have various supports for these transformations but details matter. Right now my intuition is we are missing the exact packing form that you want but I expect it would not be far from a generalization of this recent commit.

The current implementation of vectorization for conv (starting at this commit and followups) ends up generating something similar but may not be exactly the one you are looking for. Happy to iterate on generalizing until we get the form we need.

Most of the infra I described indeed starts from Linalg. There are other options too but I’ll be interested in getting the results from those abstractions to get started.

I was just extracting some transformations to reduce the opacity of the vector lowering decisions here. A linalg op get vectorized to either vector.contract, vector.outerproduct or vector.fma oeprations depending on high-level scheduling decisions. vector.contract can then lower to vector.matmul which goes to llvm.matrix.multiply. All these paths should be quite explicit in the PR I linked. Hoewever, again details matter and all the connections may not happen properly atm and give us an opportunity to improve the system and the transformations.

The way the transforms are built and connected with the transform dialect should make this mostly an implementation detail form the point of view of functionality and fully delegated to the performance model. For instance we can tile to some size computed by a cost model to fit in the memory hierarchy and then “unroll the vectors” to a target-specific size while still being “retargetable” (i.e. the execution will be correct but bad decisions will result in bad performance).

Not at this point, something that fits the bill will need to be built. We can start with heuristic implementation driven by explicitly specified constraints injected with the transform dialect to get started without overcommitting too early to a full hardware description model.

Happy to chat more and get a better idea of what we need to add to make the process easy.

1 Like