Optimizing convolutions for CPU execution

Hello,

Is there in MLIR, TF or IREE a good convolution optimization layer? I’m thinking tiling, copying, unroll and jam for fast CPU execution (as opposed to calling a library implementation).

If not, is there a good reference on how to do this optimization, something like Uday @bondhugula’s tutorial on GEMM, or a good paper?

If not: under the classical NHWC/HWCF data representation the problem does seem to reduce quite easily to an iteration over the filter spatial dimensions running a matrix multiplication and a matrix sum with an offset. Since the matrix multiplication is already covered by GEMM, things become simple. This has the advantage of exploiting the locality on the (innermost) channels dimension.
Is there something better to do than this?

Best,
D.

@antiagainst is landing options for a couple of such strategies, but it isn’t quite ready yet