TOSA reference model from MLIR using EmitC

sjarus · November 19, 2021, 4:15am

Opening this thread to discuss technical details around the development of a path to perform TOSA functional verification through the MLIR EmitC dialect.

This was briefly discussed on Discord starting with a request with @marbre but bringing it here for broader visibility.

Present

The TOSA reference model (reference_model.git - [no description]) consumes TOSA flatbuffers form. It takes the model, network inputs and emits functional network output. It is aligned to TOSA specification.

The TOSA dialect MLIR can be converted to flatbuffers using the following MLIR pass: tosa_mlir_translator.git - [no description] . It can be integrated into projects as a CMake submodule, though we also integrate it into TensorFlow Bazel builds using a custom build rule. Details in [RFC] Tosa import/export tool - #33 by sjarus .

Proposal

Drive the TOSA reference model by generating C API calls to reference model from the TOSA MLIR form using the EmitC dialect . The reference model would not be a binary but a library.

Conversation So Far

On our part, when we implemented the reference model, we considered offering a mechanism like this, but there was no immediate use case then. Adding @jsmolens and @eric-k here as additional involved folks at Arm.

The reference model isn’t a performance-focused runtime - the focus is on precision and bit accuracy for comparison to frontend reference output, and serves as a critical part of HW/SW co-design efforts.

There are some design considerations around this proposal to consider, e.g.

Passing parameter and datatype information in a manner easily parseable on the reference model side.
How to convey the optional quantization information construction properly to the reference model interface ? There is a cleanup to the dialect interface planned for the next TOSA minor version update (v0.24) that should significantly simplify this, but this update is only intended to happen in January.

marbre · November 23, 2021, 10:41am

Thanks @sjarus for starting a followup. Let me provide some additional information.

Current State

Prior to upstreaming EmitC to the MLIR core repository, we implemented a TOSA to EmitC pass. This pass allows to convert all TOSA ops included in MobileNetV2 to EmitC (https://github.com/iml130/mlir-emitc/blob/main/docs/tosa-op-coverage.md). The generated C++ code relies on operations implemented (header-only) in https://github.com/iml130/mlir-emitc/blob/main/include/emitc/emitc_tosa.h and some additional header files. This conversion/translation pass is end-to-end tested in our CI pipeline.

Proposal

If the community thinks this is a useful thing, we are open minded to upstream what we have . Like the TOSA reference model, our header-only implementation is not tarting performance. However, we currently evaluate the option to make use of Eigen (the implementation currently only depends on the standard library).
Side note: We also have a MHLO to EmitC pass, that relies on our header-only implementations.

In addition we would like to discuss if it would be useful for the community, if we refactor such that the TOSA reference model is used instead of our own header-only implementation. This would also require a more library-friendly version of the TOSA reference model. Therefore, I would love to hear the opinion of the community first.

sjarus · November 23, 2021, 5:36pm

That’s quite a lot of pieces you already have in place to interface with the TOSA reference model! We’re pleased to see you’ve played with the reference model (which uses Eigen) too.

There are some mechanical questions here around how a library form of the reference model would interface into the MLIR ‘runtime’ here . Do you already have a proposal for how that would be driven, @marbre ? Right now, the reference model has a graph traverser along with the Eigen-based functional implementation of the ops themselves.

marbre · November 24, 2021, 9:12pm

So far, I only have some initial thoughts but not yet a too concrete proposal.

The main issue is indeed the tight coupling between the serializer, the graph
traverser and the ops. Form what we have to far with TOSA → EmitC → EmitC C++ Reference Implementation (with EmitC C++ reference implementation I refer to emitc_*.h files in https://github.com/iml130/mlir-emitc/blob/main/include/emitc/), it would be straight forward for us if we would be able to replace the EmitC C++ Reference Implementation by the Eigen-based ops implemented in the TOSA reference model.
However, this would require to decouple the op implementations and make those available in a library.
Ops in the TOSA reference model are stateful. It is probably necessary to separate handling the states and the computation itself. My colleague @david_ronnenberg has looked into it and could give a more detailed description of what would be needed.
TOSA supports some datatypes for which we don’t have any support on the EmitC side so far.
The emitted tensors target emitc::Tensor. We would need an efficient conversion to TOSA/Eigen tensors or an option to directly emit those via the Cpp emitter, respectively.

So there are definitely mechanical questions that need to be solved. We’re willing to push this forward, but as mentioned before, would like to here if this is of interest for the community.

@david_ronnenberg Please feel free to comment, especially if I have missed something that we have already discussed internally.

sjarus · November 24, 2021, 10:21pm

A quick heads-up related to this - we have implemented the Statefulness Support for TOSA - #9 by sjarus - TOSA - Discourse proposal as a prototype for the purpose of expressing RNNs; it comprises the TOSA dialect utility ops, the serialization lib and reference model updates, and it works. We’ll be releasing it in the near future.

This will add some complexity around a few things, e.g. a simple memory model for maintaining persistent state contents, interfacing those utility ops since this is still in MLIR, whereas the serialized form translates these into tensors with a special is_variable bit set .

jpienaar · November 28, 2021, 3:51pm

I think that makes sense out of tree. I’d rather not have a dependency on Eigen in core (too many moving parts and I don’t think C++ standard support lines up). MHLO repo that could be fine given TF’s dependency on Eigen. Utilizing a pure BLAS (or some such reasonably standard) interface and being able to switch in different implementations would be more appealing.

Having two versions also makes sense, so you have one for correctness and the other as a baseline for comparing against codegen/where you only have a C/C++ compiler available as backend. But Eigen is a large dependency and I’d much rather have pure reference implementation + codegen in core than an optimized library implementation with more dependencies and introduce a support & maintenance requirement for it.

marbre · November 29, 2021, 5:12pm

jpienaar:

I think that makes sense out of tree. I’d rather not have a dependency on Eigen in core (too many moving parts and I don’t think C++ standard support lines up). MHLO repo that could be fine given TF’s dependency on Eigen. Utilizing a pure BLAS (or some such reasonably standard) interface and being able to switch in different implementations would be more appealing.

Having two versions also makes sense, so you have one for correctness and the other as a baseline for comparing against codegen/where you only have a C/C++ compiler available as backend. But Eigen is a large dependency and I’d much rather have pure reference implementation + codegen in core than an optimized library implementation with more dependencies and introduce a support & maintenance requirement for it.

Thanks a lot for your feedback. A pure reference implementation is what we initially had in mind and our header-only implementation therefore has no dependency except the STL. Hence, I can of course think off upstreaming the reference implementation we have so far to the MLIR core. To us the main question really is if there is interest to add such a reference implementation to the core (or to some other repo).

I therefore agree that an Eigen-based implementation makes more sense as part of an out-of-tree project. Basically, I am trying to figure out which implementation might be interesting for which user base.

We’ll definitely take a look into the proposal. When you say you’ll be releasing it in near future, is a date already scheduled? Referring to @jpienaar’s comment, what location would you like to see/suggest for a reference implementation that lower TOSA to EmitC to something tbd.

sjarus · November 30, 2021, 4:46pm

When we implemented the TOSA reference model and the MLIR pass to serialize TOSA form to drive the reference model, we faced similar issues with both Eigen and Flatbuffers dependencies, which we could not resolve how to seat within the MLIR core. Ultimately we left them as originally done - the reference model as a standalone flatbuffers/JSON consuming binary, and the MLIR pass as a standalone repo that could easily be linked with an MLIR pass manager.

To get a better understanding of the requirements, what does the EmitC path intend ? What would be the networks being emitted - single op unit tests cases for example, or full networks in TOSA MLIR ?

To run full networks, any parallel reference implementation would need to implement graph building and traversal, the basic memory model concepts, file I/O… which all amount to duplicating the existing reference model functionality.

Conceptually it seems more straightforward to have EmitC have a functional verification mode where it emits calls to construct a full MLIR TOSA form, invoke the graph builder in the existing reference model, invoke its traverser and get an actual functional bit level output.

This would be an out-of-tree path since the dependency on the external reference model and its own Eigen dependency would remain; the flatbuffers dependency would be absent since this is an independent path to construct and drive the reference model.

Having discussed this internally with @jsmolens and @eric-k we think this is feasible, though the interfaces would need to be defined.

No, we’re still trying to close out internal and external feedback loops on this, as is normal with new proposals on the TOSA discourse.

mehdi_amini · December 1, 2021, 3:48am

I’m surprised: I’m pretty sure I mentioned a path for this back then (a CMake flag to conditionally enable this basically), you shouldn’t hesitate to bring up this kind of questions as soon as they occur.

Ultimately, I believe it’ll be incredibly valuable to build as much as possible of end-to-end flow upstream.
I’ve been meaning to invest more into this for a while, and we’ll get there. The duplication may be unfortunate, but what’s the alternative? Would you be willing to upstream your work instead?

sjarus · December 4, 2021, 1:02am

Thanks for reminding me of this - I should have mentioned we had this option. At that point of time we considered it simpler to first get the pieces out, get traction and make the case for upstreaming.

Right now everything needed for TOSA functional validation is open source. The dialect, frontend legalizations (TF done, Torch in dev), reference model and the MLIR pass to emit flatbuffers to reference model.

Probably the easiest starting point here is to move the last mentioned piece into core as a conditional build rule, since it has a flatbuffers dependency. Could you point us to an existing core MLIR conditional build CMake construct ?

We’re very interested in upstreaming more pieces of TOSA infrastructure due to the interest level around it.

For example, we’d like to upstream the legalization unit test infrastructure to the frameworks. This would catch breakages in legalizations - valuable for e2e flows depending on TOSA.

We already have positive interest from @_sean_silva to go ahead with doing this for Torch-MLIR. We’d like to make the case based on that proof of concept, to upstream the unit test infrastructure for the TF/TFLite->TOSA legalizations to that repo too - something that would be run as at least an optional CI test. But perhaps this isn’t the right venue for that, and if so, is there another place we could propose it ?

mehdi_amini · December 5, 2021, 12:09am

Sure! There is:

The MLIR_ENABLE_BINDINGS_PYTHON option: this is maybe the closest one since it also relies on externally available optional dependencies, see the online doc (we could have a similar same page for TOSA stuff).
The MLIR_INCLUDE_DOCS (requires doxygen installed)
The MLIR_ENABLE_CUDA_RUNNER option: requires CUDA available for building, and a GPU for running the tests. (same with the options MLIR_ENABLE_ROCM_RUNNER , MLIR_ENABLE_SPIRV_CPU_RUNNER and MLIR_ENABLE_VULKAN_RUNNER).

Happy to help the integration as you need!

I’d be really interested to get to the point where we can have all the pieces in-tree that includes a test harness to actually run a model, with some minimalist runtime support as needed.

mtrofin · February 10, 2024, 4:36pm

Having all this in-tree would also help land ⚙ D146483 [mlgo] Add infrastructure to use EmitC-generated models for inlining. because the story of “how do I get from a model to .h/.cpp source files” becomes very simple: use tools provided by llvm (less dependencies, basically)

Looking forward to seeing this RFC land!

jpienaar · February 10, 2024, 5:43pm

I think MLGO usage would be great end to end usage that addresses one of Mehdi’s requests here.

I’m in general pro this and even more so with a full usage in tree.

marbre · February 14, 2024, 1:46pm

I would still happily work on upstreaming the conversion from TOSA to EmitC as well as the reference implementation. There are a few things to consider:

The dimensions of tensors are encoded as template arguments, which is not the most convenient way when adding new operations to the reference implementation. Alternatives could be:
- Explicitly pass tensor dimensions via arguments to the constructor.
- Model the tensor dimensions similar to mdspan.
At the moment, we test every TOSA to EmitC conversion with lit tests and have a unit test for every operation in the reference implementation. However, we do not test the interaction between generated code and reference implementation on a per-op basis. Within our repo we have an integration test which translates a MobileNetV2 from Keras and runs the compiled executable, but this does not cover every (supported) op. For upstream inclusion, we probably want to have a test suite that converts a TOSA op to EmitC, generates C++ via the emitter and compiles and executes this as a test case (?).

One thing to note, the reference implementation is based on std::vector. Thus, one cannot access the data pointer for boolean tensors.

mtrofin · February 14, 2024, 6:12pm

We’ve been using a std::vector<size_t>, would that work? (fwiw, in the tf_compile use case, which emitC would replace, we don’t even use that, we just bind by name and let testing catch invalid memory accesses.

Is the concern that we’d have an extra test burden on anything that’s not primarily interested in consuming EmitC? If so, we can use the ml-opt-* build bots to exercise all this - it’d both do integration testing like it does today - taking “some” model → tf_compile → use it; and it could do the more rigurous suite you’re mentioning.

Probably ok, so far we’ve modeled bools as int64_t anyway. Not saying it’s great, but this doesn’t sound like it’d break anything.

While here, on tensor assumptions, we basically have 2 main ones:

we can bind by name
we can use stable buffers that are in row-major order. By “stable” I mean that once EmitC object representing a model instance hands it, it doesn’t free it until the object is deleted. We’re OK with handing those buffers ourselves, fwiw (the benefit of letting the model hand them is that maybe it has preferences where to place them)

This lets us pay a higher upfront cost at binding time and then we just do simple stores in a buffer (no virtualization or anything) + eval (we can chat about this more)

marbre · February 22, 2024, 2:30pm

Yes, the use of std::vector<size_t> is one possible option to specify the shape in constructor. I think we will evaluate the different options.

Well, I am not confident that we want to introduce a dependency to TensorFlow to the test pipeline. But we probably also don’t want to include an model stored in MLIR. So what I have in mind is rather a pipeline that converts every single supported TOSA op to EmitC. This EmitC op is than translated to C++, compiled and executed.

The current implementation is row-major. However, we don’t support bind by name (AFAIR this was in one of the MGLO-related PRs) and the C++ reference implementation doesn’t provide statefulness supported for TOSA ops.

mtrofin · February 22, 2024, 6:51pm

Ack, IIRC D147570 is in that direction. Anyway, out of the scope of this thread.

Topic		Replies	Views
[RFC] Tosa import/export tool MLIR	31	2464	November 16, 2021
[RFC] TOSA Dialect in MLIR MLIR	38	7595	November 11, 2020
[RFC] Proposal for a high-level ML dialect in MLIR MLIR	181	11482	November 3, 2023
Tosa to emitC conversion TOSA mlir	2	91	January 3, 2025
Compiling a TFLite Model to Tosa-MLIR dialect MLIR	5	996	October 27, 2022

TOSA reference model from MLIR using EmitC

Related topics