What is the high-level rationale of LinalgPromotionPass?

Hi all,
I am confused with the Promotion pass in mlir. I don’t know the high-level concepts of Promotion, and I can hardly find some material of promotion. Could someone help me to understand what Promotion is?
Thanks.

The idea of this transformation is to move the inputs and/or outputs of a linalg op into their own allocation.
In general it is used on linalg op that has been tiled and this allow to move a tile of the op into a contiguous piece of memory.
There are several cases where it is useful to do that, for instance for large operations having the whole tile in contiguous memory allow closer memory accesses. Another case it is useful is for targets with different memory that may have different speed (for instance GPUs tend to have shared memory with higher bandwidth than main memory), this allow doing a DMA of a tile of data into faster memory than can then be loaded several times with high bandwidth.
Another reason it can be useful is to handle some padding on the fly, since a new allocation is created it could pad out the tile size in order to avoid border effects.

The unit test for this transform should help understand how the transformation look like in practice:

2 Likes

Thanks, it helps me a lot!
It is a perfect instruction for understanding Promotion concept.

HI Thomas
" Another case it is useful is for targets with different memory that may have different speed (for instance GPUs tend to have shared memory with higher bandwidth than main memory), this allow doing a DMA of a tile of data into faster memory than can then be loaded several times with high bandwidth…"
Is there an implementation that you can kindly point, to look into to see how it gets done?

I’m also confused here. It looks like there has no implementation of memory hierarchy in MLIR, such as shared memory and global memory in GPU.

I don’t believe there are such examples in MLIR core however in IREE we do use promotion to take advantage of the faster shared memory of some GPUs.
Here is an example where we do that for matmul code generation:

The only thing specific to GPUs in this case is to pass a lambda allocateWorkgroupMemory to create the shared memory allocation with the right address space. Then we just use the promotion pattern as is.

1 Like

Thanks for explanation.

Thanks for the pointer,
So essentially the Pass provides a callback function to register an allocation/deallocation function and copy function, which HW driver can use to create a new buffer from a faster memory of its choice, but it’s again offered with a constraint of copying the data from the original buffers to the temp buffers and back.

Thanks for the pointer,
So essentially the Pass provides a callback function to register an allocation/deallocation function and copy function, which HW driver can use to create a new buffer from a faster memory of its choice, but it’s again offered with a constraint of copying the data from the original buffers to the temp buffers and back.

Correct, note that you can decide to promote the input only in which case there is no buffer allocated for the output and you don’t need a copy back. This is typically what you want to do for matmul/convolution kind of cases where inputs are read many times but output is written only once