Buffer deallocation: missing start/finish abstraction handling

The buffer deallocation pass (-buffer-deallocation) currently just makes use of liveness information to determine when to deallocate, i.e., it relies on the SSA uses of the value (and its aliases). However, even within the MLIR tree, we have ops with start/finish abstractions, and the last use of an SSA value may not be the last use of the underlying resource! The deallocation points computed would be incorrect per liveness info, and it looks like we are missing something in the design. In the example below, we can’t deallocate after the last use of %host_buf in gpu.memcpy – we’d have to find the matching gpu.wait (or better the next use of its async token result), and place it after that. buffer-deallocation would generate incorrect IR here.

%host_buf = memref.alloc
... = gpu.memcpy %host_buf, %device_buf

The bufferization transforms passes shouldn’t/wouldn’t know anything about the gpu dialect. There is the AllocationOpInterface that provides a buildDealloc hook that could also determine where to place the dealloc customized for an alloc, but even if memref.alloc implements this, it can’t be expected to know about the GPU dialect. It looks like we need a new interface StartFinishOpInterface that provides a method to get the matching wait op (when it can) or add such a method to AsyncOpInterface? The interface can then be used from the buffer-deallocation pass.

CC: @dfki-mako @pifon2a