Compiling a TFLite Model to Tosa-MLIR dialect

Hi, I want to compile a FP16 quantized TFLite model to the Tosa dialect in MLIR and have a few confusions. This thread Missing translations when using tf-opt to translate to tosa dialect includes tf-mlir-translate and tf-opt commands that can be used to compile a Tensorflow model in protobuf format to MLIR. I am unclear how we can translate from a .tflite format buffer to Tosa-MLIR. One possible (but messy) sequence seems to be using tf-opt --tfl-legalize-tf to convert from TF-MLIR (after tf-mlir-translate) to TFL-MLIR (TFlite dialect), and then apply quantization using tf-opt --enable-float16-quantization to do the quantization in the TFlite MLIR dialect. Is that the expected sequence of passes or is there a more direct/preferable way of compiling TFlite models to TOSA? Pointers are much appreciated!

Hi Hashim,
If you’d like to go from tflite to TOSA-MLIR, you’ll want to use the flatbuffer_translate command combined with the tf-opt command like this:

flatbuffer_translate --tflite-flatbuffer-to-mlir <input tflite file> | tf-opt --tfl-to-tosa-pipeline

(flatbuffer_translate is in tensorflow/compiler/mlir/lite)

Having said that, we haven’t done testing with float16 networks to make sure that the tfl to TOSA pipeline works in those cases. We’ve focused on float32 and int8 as our primary data formats. The TOSA spec calls out float16 (and bfloat16), but the compilation pipeline is work in progress.

Thanks @eric-k! flatbuffer_translate --tflite-flatbuffer-to-mlir <input tflite file> | tf-opt --tfl-to-tosa-pipeline worked perfectly to generate the float32 TOSA output.

Thanks for the update on the float16 compilation pipeline. Alternatively, is there some way to quantize the weights while in TOSA (or lower) dialects? I see that tf-opt supports a few quantization options, e.g., tfl-post-quatize, -enable-float16-quantization, but these using these still generate float32 TOSA. Any suggestions on how to generate float16 TOSA code? Would one have to write some custom transformation?

For INT8 quantization, can it be done in the TFL dialect (tfl-post-quantize doesn’t seem to work) or would that have to be in the tensorflow source code?

Interestingly, I find that when you pass a quantized float16 .tflite module as input to flatbuffer_translate, the TOSA constants (holding weights) are infact FP16 tensors but before being passed as input to tosa.conv (and similarly other tensor ops) they are being explicitly cast to fp32 using tosa.cast operations. Does anyone know why flatbuffer_translate adds these casts?

For float16, the TOSA dialect does specify that F16 is a valid format for Tensor data. In the TensorFlow dialect, I only see a few ops (fill, dequantize, NumericVerify) that list F16 support in the dialect.
I haven’t looked at the output of a network with enable-float16-quantization run to see what its output looks like. We would need to make sure that the legalization passes properly handle F16 tensors. SInce it hasn’t been tested, I’m sure additional legalization code will need to be added.

It’s theoretically possible to quantize in the TOSA dialect, but we don’t have a pass to add quantization information there. You currently need to quantize in TensorFlow itself (Post-training quantization  |  TensorFlow Lite), and generate a quantized tflite file. From there, the tflite to TOSA pipeline should create a working TOSA int8 graph.

I successfully followed the tutorial here (same as the one you point to) Post-training quantization  |  TensorFlow Lite to generate an INT8 TFlite module using ‘full integer quantization’ (int8 inputs and outputs). Using flatbuffer_translate and tf-opt on this module generated a TOSA-mlir that uses INT8 types. Interestingly, the INT8 quantized types also includes the “scale” and “bias” for the corresponding tensor. This info would come in handy for hardware-specific lowering passes.

On FP16, I see your point. It does look like some support is missing - the --enable-float16-quantization option doesn’t help either. In the generated TOSA, I believe the tosa.cast operations are added to upcast the weights to match the FP32 inputs - the FP16 quantization in TF only seems to support FP32 inputs and outputs. I am wondering if flatbuffer_translate can be modified with small changes to take in FP16 inputs. Alternatively, one may be able to write a TOSA pass that legalizes FP32 tensors in TOSA to FP16 tensors.