We raised a discussion about an end2end compilation flow providing fully dynamic shape support several months ago, which aims to solve the limitations of current XLA in dynamic shape workloads. Here was the RFC and the relative discussion at that time.
Here are some recent updates: the initial version of this dynamic shape compiler has already been released internally in Alibaba about two months ago, started to provide services for a few internal/external inference workloads on GPU backend. Although still a lot of TODOs, the performance basically meets our expectation: close to the performance of XLA with only one compilation result on different shapes; or even exceeds it sometimes when XLA is in trouble, for example While loop with different shapes on iterations.
Here is the performance result of an internal ASR inference model as an example (in this model XLA provides dramatic gains):
It is currently made up with about 30-40 passes in all from end to end, part of which (roughly<50%) is reused or inherited from the TF/MLIR community. The backbone passes are shown in this figure:
We also made a “runtime abstraction layer” which provides interfaces on different runtime scenarios: both execution with TF runtime (like XLA) and standalone application.
Attached is the full logs of the ASR model after each passes. This might help a little bit on a rough understanding of these passes, although messed with too much information. (Pls cat them to one tar.gz)
asr.tar.gz.part_aa.txt (4 MB)
asr.tar.gz.part_ab.txt (4 MB) asr.tar.gz.part_ac.txt (4 MB)
asr.tar.gz.part_ad.txt (2.9 MB)
We also found some interesting aspects regarding to the performance:
(1) we observed that the “shape constraints” are very import on the performance in dynamic shape semantics, aka, dim_size_a is always equal to dim_size_b. Sometimes CSE can infer the same information while sometimes not, so it’s very important to explicitly have these informations on the IR, benefiting both to the graph optimization passes (eg. the fusion decisions) and bottom layer code generation. See more discussion on thread.
(2) some basic optimization is straight forward in static shape compiler but requires more work in dynamic shape compilers, for example, the loop unrolling in XLA (vectorization), and the resolving of implicit broadcast. One of the solutions is to generate multi versions of kernels and select proper ones in runtime according to runtime shapes.
Now the op coverage is roughly acceptable for inference workloads, while for training workloads still more work to do for coverage on backwards ops. Also the works on CPU backend is also ongoing.
We are interested and are now considering about pushing these codes to the community. But a headache is, there is for sure a lot of work to do for code merge, both for us and the reviewers… . We are still investigating on how to do this more efficiently. and I’m also opening this thread to collect suggestions. Please let me know what do think.