Padding for vectorization - is there support in MLIR, yet?

Hello,

When vectorizing matrix multiplications or convolutions on 2D data it may be interesting to ensure that lines have a length that is a multiple of the machine vector. This can be achieved through padding. Is there support for this in MLIR code transformations?

Best,
Dumitru

Atm there are 3 abstractions I know of:

  1. vector.transfer_read has “on-the-fly” padding and masking semantics. Depending on your HW it may or may not be a good idea to “just use that”. In many cases you want to additionally amortize this padding.
  2. Linalg on buffers has promotion which inserts padding, vectorizes to vector.transfer ops and has some primitive dependence analysis patterns to bypass roundtrips to memory and try to amortize padding. This is not the preferred path.
  3. Linalg on tensors has tiling + padding + hoisting of padding (i.e. packing) that is in active development (see post for more general context). This mentions:

I am sure a bunch of things are also being developed on top of affine but I haven’t seen them in-tree yet.

1 Like

Side note, there is another potentially interesting transformation to perform this padding at the whole graph level.

Starting from named linalg ops on tensors, it should be quite easy to create a global pass that:

  1. creates “neutral element tensors” of the right multiple sizes along each dimension
  2. subtensor_insert in the right position
  3. just use the vector.transfer_read which will then canonicalize away into just vector.load since we statically know everything divides.

The tradeoff is then about global memory consumption.
A less direct / more complicated way to get that is with padding + maximally hoisting all the way out (which will also reorder the elements of the n-D vectors into contiguous blocks of memory).

The advantage here would be the amount of hoisting (and global memory increase) can be controlled: if some op consumption blows up, we can tradeoff between “memory-hoggy amortized” and “on the fly”.

Lastly, there is also the big switch behind a library approach of connecting MLIR to libxsmm that Intel folks are pushing on, this will also connect nicely to named structured ops.

1 Like

This is very nice work and I want to try it. There are two things here:

  • Padding and buffering, which I will take a look into the following days.
  • In-place computation. Is this already functional? If yes, then I could try it in a nice example I have, a signal processing app using an FFT which is already written in MLIR. Currently, the FFT implementation works on buffers (memrefs) in order to allow in-place computation. However, if I could convert this implementation to use tensors instead (while preserving the in-place implementation) it would be nice. Is there more documentation (and maybe some examples) on this?

The prototype for this is here: https://github.com/google/iree-llvm-sandbox/tree/main/runners/test
It has special build instructions that may be outdated now as we have set up our internal build and mostly using that these days.

If you venture there, you should consider that this only works on the examples for which there are tests and ​expect rough edges here for the next few weeks as stuff continues to be flushed and hardened.

A nice byproduct though is that is is running end-to-end from python and we will be starting simple search soon (internally for now).

I don’t expect your FFT to be written in Linalg but it would be a nice experiment to write it functionally as a DFT with Linalg on tensors and have a special lowering / transformation that would be able to iterate on “log-space loops”. This is still handwavy and I haven’t tried anything yet.

There will likely be a bunch of work on 1) scf ops + yield, 2) subview with dynamic strides spawning and 3) special vectorization stuff spawning off from this but my rough intuition is this should work out.

Atm it’s just the prototype repo I linked above, we are in the process of slowly flushing bufferization, any reviewing help is most welcome if you have cycles. Once enough things are flushed, it will be nice to put together a colab to go through composition of transformations, this is a few weeks out.

So if I understand well, you forked the project in a sandbox, and when it matures enough it will be upstreamed? Do you have an approximate timeline for this?

So the idea would be that instead of creating new processing pipelines in C++ one will be able to do it in Python? This is great! BTW, is the Python interface allowing you to define new dialects and passes, or just choose passes from a given palette?

It’s not even forked, just a few extra passes added and linked with the regular MLIR + some python experiment (that’s why the build procedure is a bit strange).

The flushing of bufferization has started (check my commits and phab reviews), once this is done, the other pieces should be much simpler.