[RFC] Proposal for TLX: Tensor LLVM eXtensions

Proposal for TLX: Tensor LLVM eXtensions

Authors: Akash Kothari (UIUC), Abdul Rafae Noor (UIUC), Dounia Khaldi (Intel), Vikram Adve (UIUC), Yuanke Luo(Intel), Sudipta Sengupta (Amazon AWS), Milind Girkar (Intel), Charith Mendis (UIUC)

Rationale

Diverse hardware vendors are developing new hardware support for (mostly dense) tensor computations, which have become increasingly important for machine learning applications. These include both ISA extensions on CPUs and GPUs (such as Intel AMX, Power MMA, NVIDIA’s tensor cores, AMD’s matrix cores, and Qualcomm’s HVX vector ISA) and dedicated accelerators for compute offload (such as NVIDIA’s NVDLA, Amazon’s Inferentia and Trainium, and numerous ML accelerators from smaller companies). While ML workloads are the primary motivation and likely to be the dominant use cases, other tensor-intensive application domains, such as image processing, scientific computing, quantum simulations, financial modeling, and others can benefit from this hardware support as well, via languages like C++, DPC++, Julia, Fortran, Halide, CUDA, OpenCL, and others.

LLVM can play a crucial role in making it easier for these vendors to create optimizing compiler back-ends for their emerging hardware (if the existing vector and matrix support in LLVM were generalized to support tensor operations). LLVM is already widely-used today by many of the vendors that develop these tensor architectures, e.g., to target CPUs and GPUs. LLVM is highly retargetable, by design. For the CPU targets, LLVM allows an integrated code generation framework for tensor operations with optimized intermixing of scalar, 1-D vector and 2-D matrix operations in the same code section (e.g., loop body). And LLVM has front-ends for a wide range of high-level languages, including essentially all the languages used widely for relevant application domains today.

No existing infrastructure we know of meets these needs. MLIR is likely the best option, and we believe it is entirely complementary to LLVM. MLIR provides strong support for high-level tensor operations in TOSA, relevant optimizations in Affine and Linalg, and lowering paths to accelerators, GPUs and (via the LLVM dialect) CPUs. Crucially, however, MLIR does not have a low-level code generation framework that is retargetable to diverse hardware: it relies on LLVM for this purpose. If LLVM could be extended with tensor operations and a corresponding retargetable tensor code generation framework, MLIR could leverage this as well. Moreover, there are enough vendors and also languages that rely heavily on LLVM (but don’t use MLIR) that it seems worthwhile to have a high-quality tensor code generation framework in both LLVM as well as in MLIR. Ideally, both systems would largely share the same code.

The broad goal of our project is to add a retargetable tensor code generation framework to LLVM. We are currently working on a prototype implementation with our collaborators at Amazon AWS, Intel, IBM and Qualcomm. This RFC focuses on the first stage: extending the LLVM IR with tensor operations which we refer to as TLX (Tensor LLVM eXtensions).

Objectives

  • A unified retargetable code generation and optimization framework for LLVM to target diverse tensor architectures with a common set of IR extensions, instead of using target-specific solutions.
  • (Subject of this RFC.) A single set of target-agnostic tensor extensions in LLVM IR that higher-level tensor code generation frameworks such as XLA, Halide, TVM, MLIR, etc. can target, instead of lowering to target-specific intrinsics in LLVM, while retaining the optimizations in these high-level frameworks.
  • A pathway for LLVM-based languages such as C/C++, DPC++, Fortran, Rust, Julia, etc. that do not have front ends for compiler systems like MLIR, TVM, XLA, etc. to target modern tensor architectures by lowering to our tensor extensions in LLVM.
  • Target-independent optimizations (e.g. peephole optimizations and generic SSA-based optimizations) and also flexible code generation capabilities in LLVM that could involve mixing instructions operating on vector and rectangular registers, and involve developing cost models which could help reduce register spills and maximize usage of available hardware resources.
  • Contribute our tensor extensions (this RFC) and retargetable code generation framework (as a followup) to the LLVM project for the community to experiment with and provide feedback.

Full Proposal

Google doc with the full proposal is here.

I found the original email list reply is lost, please refer to [llvm-dev] [RFC] Proposal for TLX: Tensor LLVM eXtensions

I agree to prefer to make LLVM IR low level and simple, and primitive matrix type is too high level. So I prefer to represent matrix type by converting matrix type to vector type. But there is some problem when it meets scalable matrix and vector type. Now there is only one variable (named vscale) in scalable vector format, for example, <vscale x 4 x i32>. So we can’t represent more than one kind of scalable vector concurrently in LLVM IR. If there are more than one variable, we can distinguish whether the scalable vector is converted from scalable matrix type.

I think I would like to propose more scalable vector variables exist concurrently in LLVM IR such as <vscale1 x 4 x i32>, <vscale2 x 4 x i32>. They are all still scalable vector type but with different variable which may in different range. For example, the range of vscale1 is integer in [1,2048), and the range of vscale2 which representing scalable matrix may integer in [4, 9, 16, 25….). I think we can add variable num or name in scalable vector type to upgrade and it may also influence operations or predicates of scalable vector such as equal comparison. In all, it’s not that difficult I guess. But I have not figured out good way to handle different scaleable vector type in backend. In backend, it represents scalable vector type with MVT such as n x v4i32, n x v8i16, which enumerates all existing type objects and they are also used in tablegen. I am afraid there is not another dimension to represent different n and also work smoothly in tablegen without making n as another individual type object, for example, n2 x v4i32.

As the scalable vector is evolved by some target such ARM(SVE), RISCV(RVV), scalable matrix or another scalable design in specific hardware are becoming more common. So we need make consensus to enable it quickly.

@clattner @fhahn

Zeson