[npcomp] Adding padding support for conv2d

We are looking at implementing padding support for lowering an aten.conv2d operation through tcf.conv2d_nchw and wanted to revisit our initial discussion about adding a pad operation to the reference backend.

Currently we are using the following lowering pattern for basic conv support.

aten.conv2d(input, filter, bias, stride, padding, dilations, groups)
// -convert-aten-to-tcf
tcf.conv2d_nchw(input, filter, bias, stride, padding, dilations, groups)
// -convert-tcf-to-linalg
// bunch of shape-related error-handling
linalg.conv_2d_nchw(...)

Looking back at our original npcomp conv2d discussion, we had initially discussed adding a tcp.pad op (if padding != 0) that does a deepcopy into a padded buffer with the correct dimensions and then passes the whole thing off to linalg.conv_* (that way the linalg ops only need to know how to handle VALID padding). So the lowering flow for a conv2d would now look like this.

tcf.conv2d(input, filter, bias, stride, dilations, padding)
// --convert-tcf-to-linalg
// if padding != 0
%0 = tcp.pad(input, padding, fill_value/*=0*/)
%1 = linalg.conv2d_nchw(%0, ...)
// else
%0 = linalg.conv2d_nchw(input)

With this approach the lowering of tcp.pad would lower to something like:

tcp.pad(input, padding, fill_value) 
// convert-tcp-to-std
// get the shape of padded buffer
%pad_buf = std.alloc(sizeof(padded_buf))
%1 = std::fill(%pad_buf, input, <starting address>) // fill with subview at the right location
%2 = linalg.conv_2d_nchw(%1,...)

A few points for discussion:

  1. Does this still sound like a good first-pass approach?
  2. TCP is a mostly unused dialect currently, do we still think this is the right place to add a PadOp, or should we just add to TCF and then change over to an upstream linalg lowering when there is one?
  3. Will lowering a PadOp to our own implementation that alloc’s break our current usage of the upstream bufferization passes? It looks like the TCPBufferize logic has still stuck around for the SplatOp, but also maybe there’s a way to express the padding operation without having to allocate a constant pad buffer to move data into. Maybe we can reuse the refbackrt::AllocMemRefOp?

Yup, I think this is still a good approach.

The pad op we want here is definitely at the TCP level of abstraction, as it doesn’t have any of the sort of dynamism that characterizes TCF (and the TCP dialect is just our place to dump such TCP-level ops that don’t have a better place to live (other places would be linalg or std elementwise ops)).

We definitely want Linalg to know how to handle padding better, but it currently doesn’t have an out-of-the-box solution, so something naive like this is the obvious step (if it becomes a performance problem, that can fuel a discussion for getting this better supported in linalg).

The transformation you refer to that converts tcp.pad : (tensor, i32, f32) -> tensor to

%buf = alloc(...)
fill(%buf, %fill_value)
%unpadded_region = subview %buf[...]
copy(%input, %unpadded_region)

is precisely what will live in TCPBufferize. So it fits perfectly with bufferization.

2 Likes

I have a working rough draft of the TCP pad op at my NPCOMP fork; it very much folows Sean’s suggested transformation. It’s currently written specifically for the NCHW-format, so I am now making it format-agnostic using this I/O-specification:
- in: [D1, D2, …, DN]
- expand: [[L1, R1], [L2, R2], …, [LN, RN]]
- fill: scalar
- result: [D1+L1+R1, D2+L2+R2, …, DN+LN+RN]

Sounds good to me! Indeed, padding is a property of a tensor independent of concepts like NCHW.

1 Like

Updated my fork, add-tcf-pad branch. Here’s the current synopsis. I split the expansion operand into two to simplify the pad’s bufferization code.

Pads a tensor with `fillVal` along the borders of each dimension according to `lowerExpansion` and `upperExpansion`. Note that this op is unmanaged, meaning that it assumes its operands and their shapes are valid.
The tensors have dimensions:
- operand:   [D1, D2, ..., DN]
- lowerExpansion: [L1, L2, ..., LN]
- upperExpansion: [U1, U2, ..., UN]
- fillVal:   scalar
- result:    [D1+L1+U1, D2+L2+U2, ..., DN+LN+UN]

The pad-op’s test “fails” but the pad’s output values and shape confirm that it works.

https://github.com/llvm/mlir-npcomp/pull/148

Driving by, I noticed a linalg.pad_tensor op just landed: https://github.com/llvm/llvm-project/commit/16d4bbef30a9e625e04653047759d5636f9e58a5. Is this related?

@mikeurbach thank’s for driving by :slight_smile:

It’s exciting the linalg.pad_tensor op is landing upstream, I think we will probably want to switch over to linalg.pad_tensor in the near(ish) future.

Our main goal is to run an end-to-end convolution through pytorch with a python test similar to this mul_e2e test.

We can probably rely on this tcp.pad implementation in the meantime while we push for getting a pytorch conv through the flow, but can move over to using linalg.pad_tensor after the lights are turned on (and we integrate with upstream llvm such that the linalg.pad_tensor op exists in the npcomp world).

There’s still a few other “ML-Framework” convolution operands that we need to flush out first (stride, dilation, groups) too, but we will just assume stride=1, dilation=1, groups=1 for now to keep the ball moving.

That’s kind of what I figured, thanks for clarifying. Looking forward to seeing the end-to-end pipeline shape up.

1 Like