Estimating shape bounds for tensors and using worst-case amount of memory

Hi! I’m new to MLIR but I was wondering about level of support of dynamic shapes. I want to understand if it is possible to build a NN inference engine on top of MLIR that supports dynamic shapes and can estimate upper bound on memory consumption and hence use fixed amount of memory.
This is done for example in TensorRT, where user provides bounds for shapes and sometimes for values of input tensors, and TensorRT does all shape inference itself. Built TensorRT engine can tell how much memory it will need in the worst case, and you can even provide it pointer to preallocated memory of this size. I want to do similar thing using MLIR.
Obviously I want to reuse MLIR infrastructure (tensor, shape and linalg dialects, bufferization, e.t.c.). So the question is: are there any obstacles to implement such feature, given that I will implement shape bounds inference myself? Are all major dialects ready to work with such constraints?

This seems entirely in the realm of what can be done. The built in tensor type support an encoding attribute which could model bounds. The challenge is rather how to carry this through transformations when the system hasn’t been built around this concept of bounded shapes.

Am I right, that most straightforward approach will be probably to make a pass that finds all ops that produce tensors and updates bounds info in their outputs, and run this pass before bufferization?

Have you seen the recent ValueBounds op interface work?

I think this is a little bit difficult to answer in general. It depends on how it’s used and serialized. I like the having this be computer (where possible) as needed and then one can use. Mehdi mentioned the tensor encoding attribute, but depending on the dialect and previous inference/passes on it one may not be able to use it (at least not without clobbering other information).

There are quite a few options in this space. It can definitely be done (and the work on ValueBounds would be useful, so give it a look, it also incorporates external interfaces which is good to know here). i prefer a more analysis and separable serialization approach (more op based) than in the type. It also results in enabling more local behavior vs needing to be module pass given calls sites may need update. And then it can be made uniform across multiple dialects. But the serialize in type have same more folks using (which is pro and con, pro as POC ,.con that it may restrict one from doing the same thing).

I’m defiantly going to check progress on ValueBounds, thanks for suggestion!

Can you explain a little bit more what do you mean by analysis and separable serialization approach?

Sure, so instead of capturing always in the IR, one first construct an Analysis (in this case it could be map from value & id to a bound). So if you had

%10 = "foo.add"(%9, %11) : (tensor<10x?xf32>, tensor<?x?xf32) -> tensor<?x?xf32>

you could have %10[1] -> 20 to indicate that we have “tensor<? x [? <= 20] x f32>”. Now during the analysis phase one manipulates only in memory without creating uniqued Attributes & Types with context-scope lifetimes. This allows you to decide how to materialize it post. One way would be to use tensor encoding attribute on Tensor type (but all dialects won’t understand the encoding and might be dropped by passes) or using ops (incl “classic” precondition, postcondition style - at cost of potentially obstructing optimizations, but could be made minimal and only inserted where bounds can’t be inferred a new. For the preallocated case, one would probably only retain until at memref level and then take dynamic slices/commit dynamic update into these).

Of course this bound is one of a class of them. It could be more general arithmetic combinations (and if memory serves you’d want at least lower and upper bound captured of intermediates to cover all cases to compute upper bound).

Tradeoffs here, but that’s where I like about a non-mutating Analysis followed by application wrt enabling this tradeoff to be made, the cost is an extra iteration vs mutating as you go. But you need not mutate at all if the Analysis can be retained until consumed which is cheaper than creating new types/attributes.

1 Like

Thanks a lot for such a detailed answer!

Maybe you can also suggest a possible solution to propagating this information through bufferization pass? Because I don’t see a way, how I can match tensors before bufferization and memref allocations after bufferization pass
I suggest that op approach can help with this by converting preconditions on tensors to preconditions on memrefs. But wouldn’t it prevent bufferization optimizations?