Opening this thread to discuss technical details around the development of a path to perform TOSA functional verification through the MLIR EmitC dialect.
This was briefly discussed on Discord starting with a request with @marbre but bringing it here for broader visibility.
Present
The TOSA reference model (reference_model.git - [no description]) consumes TOSA flatbuffers form. It takes the model, network inputs and emits functional network output. It is aligned to TOSA specification.
Drive the TOSA reference model by generating C API calls to reference model from the TOSA MLIR form using the EmitC dialect . The reference model would not be a binary but a library.
Conversation So Far
On our part, when we implemented the reference model, we considered offering a mechanism like this, but there was no immediate use case then. Adding @jsmolens and @eric-k here as additional involved folks at Arm.
The reference model isnāt a performance-focused runtime - the focus is on precision and bit accuracy for comparison to frontend reference output, and serves as a critical part of HW/SW co-design efforts.
There are some design considerations around this proposal to consider, e.g.
Passing parameter and datatype information in a manner easily parseable on the reference model side.
How to convey the optional quantization information construction properly to the reference model interface ? There is a cleanup to the dialect interface planned for the next TOSA minor version update (v0.24) that should significantly simplify this, but this update is only intended to happen in January.
If the community thinks this is a useful thing, we are open minded to upstream what we have . Like the TOSA reference model, our header-only implementation is not tarting performance. However, we currently evaluate the option to make use of Eigen (the implementation currently only depends on the standard library). Side note: We also have a MHLO to EmitC pass, that relies on our header-only implementations.
In addition we would like to discuss if it would be useful for the community, if we refactor such that the TOSA reference model is used instead of our own header-only implementation. This would also require a more library-friendly version of the TOSA reference model. Therefore, I would love to hear the opinion of the community first.
Thatās quite a lot of pieces you already have in place to interface with the TOSA reference model! Weāre pleased to see youāve played with the reference model (which uses Eigen) too.
There are some mechanical questions here around how a library form of the reference model would interface into the MLIR āruntimeā here . Do you already have a proposal for how that would be driven, @marbre ? Right now, the reference model has a graph traverser along with the Eigen-based functional implementation of the ops themselves.
So far, I only have some initial thoughts but not yet a too concrete proposal.
The main issue is indeed the tight coupling between the serializer, the graph
traverser and the ops. Form what we have to far with TOSA ā EmitC ā EmitC C++ Reference Implementation (with EmitC C++ reference implementation I refer to emitc_*.h files in https://github.com/iml130/mlir-emitc/blob/main/include/emitc/), it would be straight forward for us if we would be able to replace the EmitC C++ Reference Implementation by the Eigen-based ops implemented in the TOSA reference model.
However, this would require to decouple the op implementations and make those available in a library.
Ops in the TOSA reference model are stateful. It is probably necessary to separate handling the states and the computation itself. My colleague @david_ronnenberg has looked into it and could give a more detailed description of what would be needed.
TOSA supports some datatypes for which we donāt have any support on the EmitC side so far.
The emitted tensors target emitc::Tensor. We would need an efficient conversion to TOSA/Eigen tensors or an option to directly emit those via the Cpp emitter, respectively.
So there are definitely mechanical questions that need to be solved. Weāre willing to push this forward, but as mentioned before, would like to here if this is of interest for the community.
@david_ronnenberg Please feel free to comment, especially if I have missed something that we have already discussed internally.
A quick heads-up related to this - we have implemented the Statefulness Support for TOSA - #9 by sjarus - TOSA - Discourse proposal as a prototype for the purpose of expressing RNNs; it comprises the TOSA dialect utility ops, the serialization lib and reference model updates, and it works. Weāll be releasing it in the near future.
This will add some complexity around a few things, e.g. a simple memory model for maintaining persistent state contents, interfacing those utility ops since this is still in MLIR, whereas the serialized form translates these into tensors with a special is_variable bit set .
I think that makes sense out of tree. Iād rather not have a dependency on Eigen in core (too many moving parts and I donāt think C++ standard support lines up). MHLO repo that could be fine given TFās dependency on Eigen. Utilizing a pure BLAS (or some such reasonably standard) interface and being able to switch in different implementations would be more appealing.
Having two versions also makes sense, so you have one for correctness and the other as a baseline for comparing against codegen/where you only have a C/C++ compiler available as backend. But Eigen is a large dependency and Iād much rather have pure reference implementation + codegen in core than an optimized library implementation with more dependencies and introduce a support & maintenance requirement for it.
Thanks a lot for your feedback. A pure reference implementation is what we initially had in mind and our header-only implementation therefore has no dependency except the STL. Hence, I can of course think off upstreaming the reference implementation we have so far to the MLIR core. To us the main question really is if there is interest to add such a reference implementation to the core (or to some other repo).
I therefore agree that an Eigen-based implementation makes more sense as part of an out-of-tree project. Basically, I am trying to figure out which implementation might be interesting for which user base.
Weāll definitely take a look into the proposal. When you say youāll be releasing it in near future, is a date already scheduled? Referring to @jpienaarās comment, what location would you like to see/suggest for a reference implementation that lower TOSA to EmitC to something tbd.
When we implemented the TOSA reference model and the MLIR pass to serialize TOSA form to drive the reference model, we faced similar issues with both Eigen and Flatbuffers dependencies, which we could not resolve how to seat within the MLIR core. Ultimately we left them as originally done - the reference model as a standalone flatbuffers/JSON consuming binary, and the MLIR pass as a standalone repo that could easily be linked with an MLIR pass manager.
To get a better understanding of the requirements, what does the EmitC path intend ? What would be the networks being emitted - single op unit tests cases for example, or full networks in TOSA MLIR ?
To run full networks, any parallel reference implementation would need to implement graph building and traversal, the basic memory model concepts, file I/Oā¦ which all amount to duplicating the existing reference model functionality.
Conceptually it seems more straightforward to have EmitC have a functional verification mode where it emits calls to construct a full MLIR TOSA form, invoke the graph builder in the existing reference model, invoke its traverser and get an actual functional bit level output.
This would be an out-of-tree path since the dependency on the external reference model and its own Eigen dependency would remain; the flatbuffers dependency would be absent since this is an independent path to construct and drive the reference model.
Having discussed this internally with @jsmolens and @eric-k we think this is feasible, though the interfaces would need to be defined.
No, weāre still trying to close out internal and external feedback loops on this, as is normal with new proposals on the TOSA discourse.
Iām surprised: Iām pretty sure I mentioned a path for this back then (a CMake flag to conditionally enable this basically), you shouldnāt hesitate to bring up this kind of questions as soon as they occur.
Ultimately, I believe itāll be incredibly valuable to build as much as possible of end-to-end flow upstream.
Iāve been meaning to invest more into this for a while, and weāll get there. The duplication may be unfortunate, but whatās the alternative? Would you be willing to upstream your work instead?
Thanks for reminding me of this - I should have mentioned we had this option. At that point of time we considered it simpler to first get the pieces out, get traction and make the case for upstreaming.
Right now everything needed for TOSA functional validation is open source. The dialect, frontend legalizations (TF done, Torch in dev), reference model and the MLIR pass to emit flatbuffers to reference model.
Probably the easiest starting point here is to move the last mentioned piece into core as a conditional build rule, since it has a flatbuffers dependency. Could you point us to an existing core MLIR conditional build CMake construct ?
Weāre very interested in upstreaming more pieces of TOSA infrastructure due to the interest level around it.
For example, weād like to upstream the legalization unit test infrastructure to the frameworks. This would catch breakages in legalizations - valuable for e2e flows depending on TOSA.
We already have positive interest from @_sean_silva to go ahead with doing this for Torch-MLIR. Weād like to make the case based on that proof of concept, to upstream the unit test infrastructure for the TF/TFLite->TOSA legalizations to that repo too - something that would be run as at least an optional CI test. But perhaps this isnāt the right venue for that, and if so, is there another place we could propose it ?
The MLIR_ENABLE_BINDINGS_PYTHON option: this is maybe the closest one since it also relies on externally available optional dependencies, see the online doc (we could have a similar same page for TOSA stuff).
The MLIR_ENABLE_CUDA_RUNNER option: requires CUDA available for building, and a GPU for running the tests. (same with the options MLIR_ENABLE_ROCM_RUNNER , MLIR_ENABLE_SPIRV_CPU_RUNNER and MLIR_ENABLE_VULKAN_RUNNER).
Happy to help the integration as you need!
Iād be really interested to get to the point where we can have all the pieces in-tree that includes a test harness to actually run a model, with some minimalist runtime support as needed.
I would still happily work on upstreaming the conversion from TOSA to EmitC as well as the reference implementation. There are a few things to consider:
The dimensions of tensors are encoded as template arguments, which is not the most convenient way when adding new operations to the reference implementation. Alternatives could be:
Explicitly pass tensor dimensions via arguments to the constructor.
Model the tensor dimensions similar to mdspan.
At the moment, we test every TOSA to EmitC conversion with lit tests and have a unit test for every operation in the reference implementation. However, we do not test the interaction between generated code and reference implementation on a per-op basis. Within our repo we have an integration test which translates a MobileNetV2 from Keras and runs the compiled executable, but this does not cover every (supported) op. For upstream inclusion, we probably want to have a test suite that converts a TOSA op to EmitC, generates C++ via the emitter and compiles and executes this as a test case (?).
One thing to note, the reference implementation is based on std::vector. Thus, one cannot access the data pointer for boolean tensors.
Weāve been using a std::vector<size_t>, would that work? (fwiw, in the tf_compile use case, which emitC would replace, we donāt even use that, we just bind by name and let testing catch invalid memory accesses.
Is the concern that weād have an extra test burden on anything thatās not primarily interested in consuming EmitC? If so, we can use the ml-opt-* build bots to exercise all this - itād both do integration testing like it does today - taking āsomeā model ā tf_compile ā use it; and it could do the more rigurous suite youāre mentioning.
Probably ok, so far weāve modeled bools as int64_t anyway. Not saying itās great, but this doesnāt sound like itād break anything.
While here, on tensor assumptions, we basically have 2 main ones:
we can bind by name
we can use stable buffers that are in row-major order. By āstableā I mean that once EmitC object representing a model instance hands it, it doesnāt free it until the object is deleted. Weāre OK with handing those buffers ourselves, fwiw (the benefit of letting the model hand them is that maybe it has preferences where to place them)
This lets us pay a higher upfront cost at binding time and then we just do simple stores in a buffer (no virtualization or anything) + eval (we can chat about this more)
Yes, the use of std::vector<size_t> is one possible option to specify the shape in constructor. I think we will evaluate the different options.
Well, I am not confident that we want to introduce a dependency to TensorFlow to the test pipeline. But we probably also donāt want to include an model stored in MLIR. So what I have in mind is rather a pipeline that converts every single supported TOSA op to EmitC. This EmitC op is than translated to C++, compiled and executed.
The current implementation is row-major. However, we donāt support bind by name (AFAIR this was in one of the MGLO-related PRs) and the C++ reference implementation doesnāt provide statefulness supported for TOSA ops.