Handling of string tensors

Hello, both TF and ONNX have tensors of strings; I see that there is some support added for TF String tensor type. Is there a way to have some basic functionality of tensor strings into MLIR? From cursory investigation in both ONNX & TF, very few operations create/modify tensor strings. The large majority of string-tensor accepting operations are of the type concat/reshape/ā€¦

At the lower levels of the stack, there is a pretty big difference between tensors of value/bag-of-bits types and tensors of data structures (which are typically reference types of some form under the hood). As you mention, there are very few ways in/out of string-tensor world and the operations they support tend to be either library-call based (i.e. slice_prefixes, encode_utf16, etc) or mirrors of the copy/slice/gather/concat.

In IREE, we decided with our Strings module to take advantage of the fact that this is already a well-defined type island which merely needs to interop with the numeric-tensor domain and made it a separate type/op hierarchy.

TensorFlow has dialect types that can stand-in for a string element type of a tensor (also for variants, but that is another topic), and we treat this specific to TensorFlow, legalizing into the more dedicated type/op hierarchy that IREE defines (we do the same thing for other quirky, library-backed type islands like TensorLists). We then treat this support as optional and can drop it entirely (ie. for targets destined for embedded devices and such).

Both TensorFlow and ONNX are going to need a representation for their types at the top level, and I think it is fine for those to be dialect specific ā€“ until we can see more of how this connects, it keeps us from ingesting the systemic cost for things that, once inspected, are probably subtly different anyway and are in the category of emulating what a source system defined versus designing, say, an MLIR-centric modeling of strings, sequences, etc. Iā€™d rather see if a couple of the examples that emerge reveal any patterns, and then pull that together, versus expanding a core type hierarchy like tensor first.

Also, numpy, in full generality, supports general python objects as dtypes. If we ever support that, I definitely want to see it sequestered appropriately. Pulling this thread and elevating source system types into the core can spiral pretty quickly, and I think leveraging dialect-specific types on the frontend that legalize into type/op islands on the backend is the way to go.

This may put some more pressure on making it easier to create/transform such dialect type systems, and getting that right is much more in line with what MLIR itself is positioned to do.

1 Like