Is there a way to avoid creating instructions that are going to be CSE-ed anyway

I am hitting upon a pain point when using dynamically shaped tensor types, and handling of tensor.dim operations. Lets say I have

%c0 = arith.constant 0 : index
%c1 = arith.constant 1 : index
%0 = "my_dialect.foo" ... : tensor<?x?xf32>
%1 = tensor.dim %0, %c0 : tensor<?x?xf32>
%2 = tensor.dim %0, %c1 :tensor<?x?xf32>
%3 = "my_dialect.bar" ... %foo ... : tensor<?x?xf32>
%4 = tensor.dim %3, %c0 : tensor<?x?xf32>
%5 = tensor.dim %3, %c1 : tensor<?x?xf32>

Lets say my_dialect.foo and my_dialect.bar implement the [`ReifyRankedShapedTypeOpInterface](llvm-project/InferTypeOpInterface.td at fa56e362af475e0758cfb41c42f78db50da7235c · llvm/llvm-project · GitHub). Now, if I use the interface to resolve the `tensor.dim` it will end up creating

%c0 = arith.constant 0 : index
%c1 = arith.constant 1 : index
%0 = "my_dialect.foo" ... : tensor<?x?xf32>
%1 = tensor.dim %0, %c0 : tensor<?x?xf32>
%2 = tensor.dim %0, %c1 :tensor<?x?xf32>
%c0_1 = arith.constant 0 : index
%c1_2 = arith.constant 1 : index
%3 = "my_dialect.foo" ... : tensor<?x?xf32>
%4 = tensor.dim %0, %c0_1 : tensor<?x?xf32>
%5 = tensor.dim %0, %c1_2 :tensor<?x?xf32>

(assuming that the dimensions of the result of my_dialect.bar depend on the dimensions of my_dialect.foo). In the end it doesnt matter cause it all gets CSE-ed, but this is unnecessarily creating new instructions only to be destroyed later. The problem gets worse if there are a chain of operations that implement the ReifyRankedShapedTypeOpInterface.

One way I can think of addressing this it to extend the Listener here to also intercept notify for create methods. Then a listener could keep track of all the tensor.dim operations in the IR and just return the value of an already existing tensor.dim instead of creating a new one. That would make a the “Listener” more of a “Do-er”, but it wont create new IR, just avoiding creating new instructions when not needed.

My case is particularly related to tensor.dim operations, I am happy to try any solution (if the above listener related change is not kosher and if it is within my capacity to do so).

CC :

cc: @nicolasvasilache @matthias-springer @mehdi_amini and @River707 for suggestions…

Do you actually need to reify the dimensions of %0 = "my_dialect.foo" and %3 = "my_dialect.bar" as SSA values? What is happening with the tensor.dim ops afterwards?

If you just need the tensor dimensions for further analyses and if they don’t need to be materialized in IR, you could use the ValueBoundsOpInterface::computeBound, which will produce an AffineExpr wrt. to a specified set of values/tensor dims (⚙ D145681 [mlir][Interfaces] Add ValueBoundsOpInterface and tensor dialect op impl). There’s also a helper function to materialize such a bound in IR (mlir::reifyValueBound).

E.g., if you have IR such as:

func.func @foo(%t: tensor<?xf32>) {
  %0 = "foo"(%t)
  %1 = "bar"(%0)
  %2 = "qux"(%1)
  // Reify the dimension of %2
}

ValueBoundsOpInterface can be used to generate tensor.dim %t, %c0 in one go. (No other ops are created/erased/…) But it would not solve the CSE’ing issue in case there are already existing tensor.dim %t, %c0 ops in the program.

If you try to do the same with the current ReifyRankedShapedTypeOpInterface, a bunch of tensor.dim ops are created, only to be erased again when “resolving” the tensor.dim.

ValueBoundsOpInterface might be useful… Is that all landed. I can give that a shot…

Also curious if the results of the ValuesBoundOpInterface are cached/reused. If we end up traversing the whole IR everytime to build the bounds thats a huge compile time overhead.

It should land soon. The implementation is finished and went through a first round of review. A few interface implementations may be missing for your use case, but these are easy to add.

I haven’t thought about this much until now, but this would be easy to implement. None of the implementation would have to be changed, we just need a public entry point that takes an existing constraint set (analysis state). Then it would not re-analyze/traverse IR that was already analyzed.

One issue is that changing the IR could invalidate the analysis. (Adding new IR is fine, but in-place modifications and removing ops is not safe.) That is why we don’t expose the underlying constraint set to the user at the moment. But if it’s well documented it should be fine.

I think we have to be very careful with this. It reminds me of past major struggles with SCEV, where the IR was modified but the SCEV cache was not invalidated, SCEV cache was answering queries based on the cached no longer valid information until the cache invalidation inadvertently happened, SCEV analysis was recomputed automatically and we ended up with a mix of invalid and invalid information during a single transformation.

I think we have to define the lifetime the ValueBounds analysis its proper usage. It’s not clear to me at this point if it should be implemented as an MLIR Analysis (Pass Infrastructure - MLIR) or something else. Something we can iterate on once we have the basic implementation in.