Development of high-level Tensor Compute Primitives dialect(s) and transformations

MLIR was designed to enable unifying ML compilation frameworks, allowing teams to focus on their core value while improvements in one framework benefit others. One of the missing features (footnote: we have placed “ML dialect” as a teaser/lure on our open design meetings a few time in the past ;-)) is a set of common ops that multiple different ML frameworks could target, as a layer designed from first principles and suitable as an optimization IR.

As mentioned at the end of the TensorFlow MLIR SIG presentation about nGraph, we believe it is time to start a collaborative effort to provide MLIR with capabilities that would serve multiple ML frameworks: in particular a dialect that could be targeted by multiple ML-frontends and suitable for further optimization and lowering/codegen. Basically this is about all the layers above Chris’ thoughts on the codegen stack. We think that building such a unified solution, within MLIR, that is directly informed by and internalizes the lessons of predecessors in the optimization and code generation spaces will be valuable and successful.

In practice, we propose to get together the interested people to implement this in MLIR and iterate on a prototype in-tree. Similar working group exist in LLVM, for example to sync on the RISC-V effort or the Loop Optimization WG, and we propose to organize ourselves similarly:

  • The main goal is to coordinate the interested parties on the actual implementation in MLIR.
  • Discourse (here) is the main place for coordinating (possibly in a dedicated section otherwise in the general MLIR part)
  • We plan to host a bi-weekly open conference call, with an agenda defined ahead of time.
  • There is no formal membership process: participating in the discussions and submitting patches is open to anyone, and is, effectively, the membership process.

One of the first objectives for us would be to define the guiding principles for the dialect(s) and weigh trade-offs (e.g. what does it mean to be a transformation-first IR? Should broadcasting be implicit or explicit with an op? How orthogonal the ops should be? How pure / what about side-effects? Control-flow? Should it be closed under differentiation? How are semantics for the ops expressed?). We also expect this work to feed proposals for improving/extending some of MLIR concepts and core dialects when it is necessary to support this new effort.

We very much look forward to many active discussions!

8 Likes

Thanks Mehdi for starting this very important discussion.

Do we have a list of prior art in this space? I think XLA’s HLO representation and nGraph are kind of what we are talking about here. Are there others?

Also, I’d like to throw out there the important requirement for dynamic shapes. I’m not sure how orthogonal or coupled shape-related infrastructure (e.g. shape function definitions) is to the dialect we are discussing here, but I feel like we should at least be able to articulate if it is or isn’t. It would be especially interesting if we could find prior art for dynamic shape support with an IR at the level of abstraction we are targeting in this thread.

Yes, HLO and nGraph are the one I’m the most familiar with, hopefully other interested folks here knows better other frameworks.
I agree that dynamic shape is an important part at this layer, and the work on shape modeling isn’t totally orthogonal. I believe that @jpienaar has something coming on this.

Thanks, Mehdi for starting this conversation. On our project, we’ve struggled with the lack of high level primitives in this area. Given that there wasn’t a conversation going on about it, it causes common infra/ops to get developed outside of MLIR with no place to anchor those things that make sense to go core-ward. Also, having lived through a number of rounds of debates on opinionated op-sets, I appreciate the tone you have set with respect to creating a space to work this out. While my most recent background is with XLA/HLO, I’ve worked with a number of others at different layers and would like the chance to distill out some common ideas and infra. There are a lot of priors on this, but in my experience, we have to re-learn the best way to express them in MLIR, which gives us a lot more flexibility than older representations (with all of the positives and negatives that come from flexibility).

Jumping a bit into the technical discussion, I think that one of the issues with HLO specifically is that it combines opinions on a few different axes that don’t necessarily need to be combined in new work.

  1. Static shapes (versus a full story for dynamic)
  2. Implicit vs explicit broadcast semantics
  3. Functional control flow (vs CFG)
  4. Preference for “primitive” vs high-level math ops
  5. Preference for explicit reductions vs high-level aggregate ops

When bundled into a single-dialect and codegen path, all of these opinions get taken at once, and there are reasonable arguments for alternatives to each (and others). Based on the discussions on the mailing list and our experience of late, it is actually #1 (and by extension #2, since you have to handle that) which benefit the most right now from some common infra, and I’m not sure that needs to be tied to op story (and can support different sets of high level ops).

In our work for #3-5, we end up taking different opinions about these at different parts of the pipeline, and I think that MLIR lets us do this: we don’t necessarily need to converge on one “best” set of “dnn math ops”. Since there are a small set of canonical ways to express them, we might opt to define them at multiple levels (i.e. have high level “nn” ops like softmax and relu and what they are implemented in terms of). My opinion for this working group would be to get the lowest common level specified in MLIR itself, possibly leaving the higher levels to frontend specific framework. In practice, this has worked reasonably well for TensorFlow/XLA.

For dynamic shapes specifically, there is a lot bound up in that with respect to a “source-level” representation. Discourse isn’t letting me post links to github, where we are working on a sample shape dialect, but here is the general direction of types/ops we think would be helpful. I suspect that we need to define some new shape-related types, corresponding ops and then that gets into broadcasting quickly, which perhaps should be considered at the same level:

%shp0 = shape.get_ranked_shape %arg0 : tensor<?x5xf32> -> !shape.ranked_shape<?x5xindex>
%shp1 = shape.get_ranked_shape %arg1 : tensor<5xf32> -> !shape.ranked_shape<5xindex>
%dim0 = "compute_broadcasted_shape"(%shp0, %shp1) : (!shape.ranked_shape<?x5xindex>, !shape.ranked_shape<5xindex>) -> (index)  // Can be codegened directly
%1 = shape.ranked_broadcast_in_dim %arg1, %dim0 { broadcast_dimensions = dense<1> : tensor<1xi64> } : tensor<5xf32> -> tensor<?x5xf32>

To wrap up, I think it would be great to have the work group break the problem up a bit and focus semi-independently on:

  1. Dynamic shape related infra (probably with sub-points for high-level representations and things like shape inference)
  2. Structural primitives (reductions, control flow, etc)
  3. Small set of primitive math ops

It doesn’t take very complicated examples to meaningfully need resolution on each of these.

  • Stella

Examples:

Examples:

Add op broadcasting types

// op-carried
%24 = "xla_hlo.add"(%23, %4) {broadcast_dimensions = dense<1> : tensor<1xi64>} : (tensor<?x10xf32>, tensor<10xf32>) -> tensor<?x10xf32>

// explicit
%8 = "xla_hlo.broadcast_in_dim"(%2) {broadcast_dimensions = dense<1> : tensor<1xi64>} : (tensor<16xf32>) -> tensor<?x16xf32>
%9 = xla_hlo.add %7, %8 : tensor<?x16xf32>

tanh vs sigmoid

// TANH
%10 = "xla_hlo.tanh"(%9) : (tensor<?x16xf32>) -> tensor<?x16xf32>

// SIGMOID
%0 = xla_hlo.constant dense<5.000000e-01> : tensor<f32>
%13 = "xla_hlo.broadcast"(%0) {broadcast_sizes = dense<[-1, 16]> : tensor<2xi64>} : (tensor<f32>) -> tensor<?x16xf32>
%14 = xla_hlo.mul %12, %13 : tensor<?x16xf32>
%15 = "xla_hlo.tanh"(%14) : (tensor<?x16xf32>) -> tensor<?x16xf32>
%16 = xla_hlo.mul %15, %13 : tensor<?x16xf32>
%17 = xla_hlo.add %16, %13 : tensor<?x16xf32>

XLA softmax

    %36 = "xla_hlo.reduce"(%35, %1) ( {
    ^bb0(%arg1: tensor<f32>, %arg2: tensor<f32>):	// no predecessors
      %45 = xla_hlo.max %arg1, %arg2 : tensor<f32>
      "xla_hlo.return"(%45) : (tensor<f32>) -> ()
    }) {dimensions = dense<1> : tensor<1xi64>} : (tensor<?x10xf32>, tensor<f32>) -> tensor<?xf32>
    %37 = "xla_hlo.broadcast_in_dim"(%35) {broadcast_dimensions = dense<[0, 1]> : tensor<2xi64>} : (tensor<?x10xf32>) -> tensor<?x10xf32>
    %38 = "xla_hlo.broadcast_in_dim"(%36) {broadcast_dimensions = dense<0> : tensor<1xi64>} : (tensor<?xf32>) -> tensor<?x10xf32>
    %39 = xla_hlo.sub %37, %38 : tensor<?x10xf32>
    %40 = "xla_hlo.exp"(%39) : (tensor<?x10xf32>) -> tensor<?x10xf32>
    %41 = "xla_hlo.reduce"(%40, %2) ( {
    ^bb0(%arg1: tensor<f32>, %arg2: tensor<f32>):	// no predecessors
      %45 = xla_hlo.add %arg1, %arg2 : tensor<f32>
      "xla_hlo.return"(%45) : (tensor<f32>) -> ()
    }) {dimensions = dense<1> : tensor<1xi64>} : (tensor<?x10xf32>, tensor<f32>) -> tensor<?xf32>

“simplified” mlp (tanh, no softmax)

  func @predict_tanh_no_softmax(%arg0: tensor<?x16xf32>) -> tensor<?x10xf32> attributes {iree.module.export, iree.reflection = {abi = "sip", abiv = 1 : i32, sip = "I8!S5!k0_0R3!_0"}, tf._input_shapes = ["tfshape$dim { size: -1 } dim { size: 16 }", "tfshape$unknown_rank: true", "tfshape$unknown_rank: true", "tfshape$unknown_rank: true", "tfshape$unknown_rank: true", "tfshape$unknown_rank: true", "tfshape$unknown_rank: true"], tf.signature.is_stateful} {
    %0 = flow.variable.load @h2_bias : tensor<16xf32>
    %1 = flow.variable.load @out_bias : tensor<10xf32>
    %2 = flow.variable.load @h1_bias : tensor<16xf32>
    %3 = flow.variable.load @h2_weights : tensor<16x16xf32>
    %4 = flow.variable.load @out_weights : tensor<16x10xf32>
    %5 = flow.variable.load @h1_weights : tensor<16x16xf32>
    %6 = "xla_hlo.dot"(%arg0, %5) : (tensor<?x16xf32>, tensor<16x16xf32>) -> tensor<?x16xf32>
    %7 = "xla_hlo.add"(%6, %2) {broadcast_dimensions = dense<1> : tensor<1xi64>} : (tensor<?x16xf32>, tensor<16xf32>) -> tensor<?x16xf32>
    %8 = "xla_hlo.tanh"(%7) : (tensor<?x16xf32>) -> tensor<?x16xf32>
    %9 = "xla_hlo.dot"(%8, %3) : (tensor<?x16xf32>, tensor<16x16xf32>) -> tensor<?x16xf32>
    %10 = "xla_hlo.add"(%9, %0) {broadcast_dimensions = dense<1> : tensor<1xi64>} : (tensor<?x16xf32>, tensor<16xf32>) -> tensor<?x16xf32>
    %11 = "xla_hlo.tanh"(%10) : (tensor<?x16xf32>) -> tensor<?x16xf32>
    %12 = "xla_hlo.dot"(%11, %4) : (tensor<?x16xf32>, tensor<16x10xf32>) -> tensor<?x10xf32>
    %13 = "xla_hlo.add"(%12, %1) {broadcast_dimensions = dense<1> : tensor<1xi64>} : (tensor<?x10xf32>, tensor<10xf32>) -> tensor<?x10xf32>
    %14 = "xla_hlo.tanh"(%13) : (tensor<?x10xf32>) -> tensor<?x10xf32>
    return %14 : tensor<?x10xf32>
  }

full mlp

  func @predict(%arg0: tensor<?x16xf32>) -> tensor<?x10xf32> attributes {iree.module.export, iree.reflection = {abi = "sip", abiv = 1 : i32, sip = "I8!S5!k0_0R3!_0"}, tf._input_shapes = ["tfshape$dim { size: -1 } dim { size: 16 }", "tfshape$unknown_rank: true", "tfshape$unknown_rank: true", "tfshape$unknown_rank: true", "tfshape$unknown_rank: true", "tfshape$unknown_rank: true", "tfshape$unknown_rank: true"], tf.signature.is_stateful} {
    %0 = xla_hlo.constant dense<5.000000e-01> : tensor<f32>
    %1 = xla_hlo.constant dense<0xFF800000> : tensor<f32>
    %2 = xla_hlo.constant dense<0.000000e+00> : tensor<f32>
    %3 = flow.variable.load @h2_bias : tensor<16xf32>
    %4 = flow.variable.load @out_bias : tensor<10xf32>
    %5 = flow.variable.load @h1_bias : tensor<16xf32>
    %6 = flow.variable.load @h2_weights : tensor<16x16xf32>
    %7 = flow.variable.load @out_weights : tensor<16x10xf32>
    %8 = flow.variable.load @h1_weights : tensor<16x16xf32>
    %9 = "xla_hlo.dot"(%arg0, %8) : (tensor<?x16xf32>, tensor<16x16xf32>) -> tensor<?x16xf32>
    %10 = "xla_hlo.add"(%9, %5) {broadcast_dimensions = dense<1> : tensor<1xi64>} : (tensor<?x16xf32>, tensor<16xf32>) -> tensor<?x16xf32>
    %11 = "xla_hlo.broadcast"(%0) {broadcast_sizes = dense<[-1, 16]> : tensor<2xi64>} : (tensor<f32>) -> tensor<?x16xf32>
    %12 = xla_hlo.mul %10, %11 : tensor<?x16xf32>
    %13 = "xla_hlo.tanh"(%12) : (tensor<?x16xf32>) -> tensor<?x16xf32>
    %14 = xla_hlo.mul %13, %11 : tensor<?x16xf32>
    %15 = xla_hlo.add %14, %11 : tensor<?x16xf32>
    %16 = "xla_hlo.dot"(%15, %6) : (tensor<?x16xf32>, tensor<16x16xf32>) -> tensor<?x16xf32>
    %17 = "xla_hlo.add"(%16, %3) {broadcast_dimensions = dense<1> : tensor<1xi64>} : (tensor<?x16xf32>, tensor<16xf32>) -> tensor<?x16xf32>
    %18 = "xla_hlo.broadcast"(%0) {broadcast_sizes = dense<[-1, 16]> : tensor<2xi64>} : (tensor<f32>) -> tensor<?x16xf32>
    %19 = xla_hlo.mul %17, %18 : tensor<?x16xf32>
    %20 = "xla_hlo.tanh"(%19) : (tensor<?x16xf32>) -> tensor<?x16xf32>
    %21 = xla_hlo.mul %20, %18 : tensor<?x16xf32>
    %22 = xla_hlo.add %21, %18 : tensor<?x16xf32>
    %23 = "xla_hlo.dot"(%22, %7) : (tensor<?x16xf32>, tensor<16x10xf32>) -> tensor<?x10xf32>
    %24 = "xla_hlo.add"(%23, %4) {broadcast_dimensions = dense<1> : tensor<1xi64>} : (tensor<?x10xf32>, tensor<10xf32>) -> tensor<?x10xf32>
    %25 = "xla_hlo.broadcast"(%0) {broadcast_sizes = dense<[-1, 10]> : tensor<2xi64>} : (tensor<f32>) -> tensor<?x10xf32>
    %26 = xla_hlo.mul %24, %25 : tensor<?x10xf32>
    %27 = "xla_hlo.tanh"(%26) : (tensor<?x10xf32>) -> tensor<?x10xf32>
    %28 = xla_hlo.mul %27, %25 : tensor<?x10xf32>
    %29 = xla_hlo.add %28, %25 : tensor<?x10xf32>
    %30 = "xla_hlo.reduce"(%29, %1) ( {
    ^bb0(%arg1: tensor<f32>, %arg2: tensor<f32>):	// no predecessors
      %35 = xla_hlo.max %arg1, %arg2 : tensor<f32>
      "xla_hlo.return"(%35) : (tensor<f32>) -> ()
    }) {dimensions = dense<1> : tensor<1xi64>} : (tensor<?x10xf32>, tensor<f32>) -> tensor<?xf32>
    %31 = "xla_hlo.sub"(%29, %30) {broadcast_dimensions = dense<0> : tensor<1xi64>} : (tensor<?x10xf32>, tensor<?xf32>) -> tensor<?x10xf32>
    %32 = "xla_hlo.exp"(%31) : (tensor<?x10xf32>) -> tensor<?x10xf32>
    %33 = "xla_hlo.reduce"(%32, %2) ( {
    ^bb0(%arg1: tensor<f32>, %arg2: tensor<f32>):	// no predecessors
      %35 = xla_hlo.add %arg1, %arg2 : tensor<f32>
      "xla_hlo.return"(%35) : (tensor<f32>) -> ()
    }) {dimensions = dense<1> : tensor<1xi64>} : (tensor<?x10xf32>, tensor<f32>) -> tensor<?xf32>
    %34 = "xla_hlo.div"(%32, %33) {broadcast_dimensions = dense<0> : tensor<1xi64>} : (tensor<?x10xf32>, tensor<?xf32>) -> tensor<?x10xf32>
    return %34 : tensor<?x10xf32>
  }

Could we please start a new tag for this where the WG proposal would be somehow pinned on top?
I think @stellaraccident’s message should be a full topic that deserves its own back and forth under this “tag”.
There are also a bunch of other angles to consider that each should be well separated posts + threads rather than responses to this announcement message.

Thanks for considering!

I would be happy to relocate my message but don’t know how to create a tag. Presumably some kind of tag for the proposed effort?

Thanks, Mehdi for initiating this. Sorry for the late reply, was on vacation last week.

I think it would be great to unify all the efforts under one ML dialect, and will definitely reduce our nGraph efforts in defining our own core dialect and porting optimizations. Hopefully, there will be more interest from other framework owners.

Defining a common set of low-level primitive ops along with the data-types sounds like a good starting point. Without getting into technical details, here are few things we would like to see:

  1. A fused op representation that contains its implementation (as a region), and can either be intercepted by the backends and mapped to dnn lib kernels or expanded.
  2. Ability to annotate and add custom info to Tensor types (e.g. dynamic dims range)
  3. Primitive control-flow ops in its own separate dialect that can be readily optimized by MLIR passes. Maybe re-use MLIR Loop dialect ops ?
  4. A common set of optimization/analysis passes for the dialect. For example, shape specialization based on graph inputs would be very useful.

Thanks,
nagy

With regards to the question of a small set of primitive math-ops mentioned in @stellaraccident 's note above: it seems to me that it is useful to decouple “tensor ops” from “scalar ops”. As the title of this thread indicates, what is more interesting (and likely more controversial) are tensor ops, since they provide many different optimization possibilities. There is some hope, however, that the number of core tensor ops/primitives can be kept small, especially if we have a separate scalar-op dialect. The tensor ops will be higher-order ops that take scalar-functions as parameters.

As an example, consider the expansion of SIGMOID in the above message. It seems to me that it would be simpler to express SIGMOID as the invocation of a single tensor-op (say UnaryElementWiseOp) that takes a scalar-function as parameter (expressed as a region in MLIR) that does the scalar-computation of sigmoid (using scalar ops mul/tanh/mul/add instead of tensor versions of the same ops).

E.g., it looks like this is the direction that the StructuredOp in the Linalg dialect is taking (except that it is the lower-level of buffers). Wouldn’t the same kind of approach make sense at the tensor-level too?

Absolutely and this is precisely the direction in which the semantics of linalg.generic and linalg.indexed_generic has evolved in the past few weeks. We realized that having custom ops that can work on both buffers and tensors simplifies a lot of issues and could help with the phase ordering problem of buffer allocation + layout and other transformations.

Thanks, @g.ramalingam. Interesting idea, and raises a few questions:

Is the intention here that frameworks/front-ends can define an op semantic by attaching the scalar op function to it ? And if so, will the ops still have “default” semantics, if the scalar-function is missing?

If the ops always come with a scalar-function that defines what they do, it is not clear to me how tensor-level optimizations/analyses will understand such ops.

Thanks!

Very interested in this topic, which is also related to quantization. A minimal, yet complete set of ops whose semantics and properties are well-known and modelled, would go a long way towards building quantization at “this level”.

I am eager to help the discussions on this forum wrt Quantization.

Absolutely. I wasn’t expecting to get to detailed in this thread, I just wanted to gauge the interest! :slight_smile:

Creating a sub-section was part of the proposal, however Discourse does not support it yet (they said “early this year”): it supports two-levels of nesting right now and MLIR is under LLVM. I’ll look into what we can do for now.

Hi Mehdi,

Thanks a lot for taking the lead on this. I think this is a very important problem in frontend / IR / compiler codesign that should be tackled by the community.

We hope that existing work on the Linalg dialect (see Design Document for the Linalg Dialect ) can help move the discussion forward and help exhibit some of the tradeoffs involved.

Looking forward to live discussions!

MLIR-TCP-WG may be too much of a mouthful ?

This is certainly an awesome effort; thanks Mehdi for seeding the discussion!

I agree with Stella’s point that we should decompose this large problem space into smaller ones to make each one more focused and tractable. They interact with each other though: dynamic shape support will certainly affect how we choose high-level NN/math ops. So it seems to me that apart from defining the principles and dividing the space, we might also want to solve more fundamental and far-reaching aspects like dynamic shapes first before discussing detailed ops; otherwise we might continuously going back to discussing those fundamental points. For op sets, I can see we have multiple levels of abstractions that we can create and some of them already exists in MLIR core (albeit incomplete or not holistically thought-out maybe). It would be nice to repurpose or build upon them.

@nmostafa : good question. Let me clarify what I meant.

I am talking about higher-order ops that always take a scalar-op-function parameter. Without this parameter, they are incomplete.

The idea you mention as point 1 in your other message is about having instructions that contain a call to an op, along with a region that defines its semantics (and can be used by an implementation that does not understand the specific op) is orthogonal.

I believe that the key optimizations and transformations can often be done without a dependence on the “scalar computation” component: for example, all unary-element-wise ops can use the same optimizations, all binary-element-wise ops can use the same optimizations, etc. (Well, I mean most of the optimizations we use in practice; there could be some rare optimization that exploits some property of the scalar-computation).

One of the missing features (footnote: we have placed “ML dialect” as a teaser/lure on our open design meetings a few time in the past ;-)) is a set of common ops that multiple different ML frameworks could target, as a layer designed from first principles and suitable as an optimization IR.

Is this really restricted to a “set of common ops” for ML frameworks or could this be extended to other ops like OpenVX/Opencv?

In an ONNX thread on MLIR google group @stephenneuendorffer shortly mentioned something related to OpenVX/Opencv vs ML landscape.

P.s. Just to give Opencv G-API reference OpenCV: Graph API

Will this cover both DNN and traditional ML? or starting from DNN only please?

Thank you mehdi for the initiative. This already pended in tensorflow compiler for a while.
Few ideas in my mind for sharing, interested to enable it in a iterative way such as enhance_n_refine steps. It will be good have a unbrella initiative with Bunch of smaller components underneath. Interested to start this as a extended in current xla compiler since it’s kind of mature for certain cases.
We are interested to bring dynamic shape & control flow and few non-algebra ops into the tensorflow compiler to handle more deep learning parts, not only don computing part, but also data processing etc.
From the engineer and design point of view, redesigning a complete new ir now may not the best choice. Why not taking certain scenario such as supporting embedding ops or dynamic shape in tensor flow compiler and implement them as a subset with bringing a end-2-end so far best practice in the code base. Meanwhile, many folks have his own problem wot solve, we can start to enhance the ir in a systematic way. And also, we have something already worked as a general compiler, not only for static information.
For now, xla has a good base implement for cpu,gpu & tpu, with potential tf & partial-pytorch in a static way with limited control flow supporing. For now, we can take this as a initial implement and I’d prefer to reuse the xla infrastructure as much as possible. Since milr is good to do interface and bridge too, we may be able to propose a hybrid way. xla will be as it is. milr’s tf compiler flow will be able to start to implement the extra logic such as tf.unique, tf.sliceop and control flow etc with a complete end-2-end compile flow. The milr may not have the same op coverage as a start point view, but focus on the correctness and functionality. For hpcg, we may leave it to the hardware vendor’s api as the begging.

we proposed the dynamic shape in tensorflow milr group. And I’d like to take all our experience to this initiative in a construct way.
This is very high level. Let’s come up a flow chart then discuss each of them in detail and finalize it with a good way as start point.

looking forward to the discussion.

I’d like to leave it open to the people interested in actually building this to define the actual scope.
My take is that we really want to have a compiler IR here, designed with transformations/optimizations in mind. I am not confident enough to answer how orthogonal (or how similar) are the optimizations on the usual set of op in the Tensor domain (like in nGraph/HLO) compare to what you would want to achieve with image processing kind of primitives from OpenVX/OpenCV.

I work on TensorFlow and I am more familiar with DNN, can you help clarifying what it would encompass? Any pointer to existing framework? It may just be better addressed a different IR with different optimization techniques, we should look into it!