Is it possible to use Linalg::GenericOp (Linalg::IndexedGenericOp) to express a computation having multiple shape-inconsistent outputs?

According to my understanding, it seems that currently it’s impossible to using Linalg::GenericOp (Linalg::IndexedGenericOp) to express a computation having multiple shape-inconsistent outputs.

20200303152751

For example, as shown in the above picture, we have a computation with two outputs. The shape of the first output (add) is larger than the shape of the second output (exp). It is impossible to use Linalg::GenericOp to express it because currently linalg.yield always returns the same number of valid values, which is further translated into some store instructions.

The above limitation can be lifted if the type of the operands/results of Linalg.yield supports something like optional. For example, the above computation can be expressed like following pseudo code.

// a computation having two outputs. 
// The shape of the first output is large than the shape of the second output. 
linalg.indexed_generic(...) {
     x = calculate an element of the first output
     calculate condition based on the indexes
     if (condition is true) {
        y = calculate an element of the second output
        linalg.yield x, y
    } else {
        linalg.yield x,None // Optional yield, which means no need store instruction for the second output
    }
}

My questions are:

  1. Does linalg::GenericOp support computations having multiple shape-inconsistent outputs?
  2. If not, Is there any plan to support it? How about the above proposed solution?

Thanks!

No.

It could be an interesting extension. However, Linalg is driven by transformations rather than expressiveness, as described in the rationale. Linalg explicitly discourages anything that would require non-trivial analysis of the computation, and your sketch does require that. If you can figure out how one can describe such outputs in the attributes of linalg.generic, using affine maps, and how we can understand that statically at compile time, that could be a step towards supporting what you want. In particular, how do we know which executions of the body should not produce values? Do we still perform the computations those executions, or do we guarantee some sort of DCE (there may be side effects, when reading values even if you don’t write).

Also ping @nicolasvasilache.

As @ftynse mentioned it is not currently possible.

Extensions to Linalg are most welcome, as long as they are compatible with the design principles in the rationale.

For your use case, one simplistic answer would be to say that you can represent your graph with 2 linalg ops and fusion will create enclosing common loops with linalg ops inside giving you a more natural representation of imperfectly nested control flow. Trying to cram everything in a single perfectly nested region is certainly doable but is it the right thing to do? In general it is not but YMMV.

In terms of other extensions we are pondering:

  1. We could let the region take views and allow nested linalg ops. While this is possible and would augment the expressiveness with some form of structured nesting, I have not yet seen compelling reasons to do it: so far perfect nesting + using fusion + loops express imperfect nesting as the result of transformations seems quite powerful (see the closedness under transformations observation).

  2. One could think of the current implementation of linalg -> loops conversion as 1-level of indirection through memory (i.e. insert load and stores). Gather-scatter type of semantics could correspond to 2-levels of indirection through memory. 0-level could make sense too to have load / store forwarding by construction. There may be a way to achieve what you want with a different level of indirection (e.g. gather-scatter), and that could also serve other purposes (but this is still a very preliminary idea).

  3. Potentially the solution you propose, but I have not yet worked through its implications yet.

In particular Alex’s point is again spot on: we want to encode all the information in the type system with e.g. affine maps. We could also use something else than affine maps (and we will in the future for more dynamic data types) but the key is to avoid the inverse problem of matching complex structures in the IR and performing difficult analyses to recover information that we don’t need to lose in the first place.

The design space is pretty large and it is easy to go in a direction that will be hard to backtrack from so we are voluntarily conservative: implication of extensions must be well understood on all existing transformations.

Now that I unpacked some of the discussion, I would ask why are you interested in this particular design?
Should I extrapolate from your partial example that you are interested in forcing fusion in the tensor world by way of turning imperfectly-nested control-flow into perfectly nested one + noops?
There is definitely value in that re. buffer allocation and memory guarantees and maybe it is a case of supporting non-load/store access.

If you are only interested in the buffer world, I would suggest relying on fusion: the optional yield looks like an optimization that would fit in that scenario.

Could you expand what you would like to do a little more please?
I am also happy to chat over VC: I find the communication bandwidth of text to be extremely low for these type of deeper design discussion.

Meta: yes, but writing text forces you to think through your arguments carefully.

Thanks for all your answers.

Yes. This idea is originated from the work supporting kLoop fusion template of XLA in the tensor world. Buffer-based fusion can be more powerful while is also more verbose especially for the simple kLoop fusion case (e.g. we need explicit shape inference and buffer allocation before we reach the buffer world). Thus we may prefer tensor-based fusion in the simple case.

Extension like above indeed introduces some complexity for transformations. After reading the answer about “what and how” here, I think loop.parallel maybe more suitable for this case.

Btw, linalg dialect now supports both tensor-based fusion and buffer-based fusion. Is there any suggestion about how to combine tensor-world and buffer-world smoothly?

According to my understanding, what tensor-based fusion does is just to combine multiple generic ops into one. Due to the expressiveness of linalg.generic, it seems that in most cases we could not get the whole fusion pattern in the tensor world, and end up with doing partial fusion in tensor world and then lowering to buffer world to the other part of fusion?

Great, tensor-based fusion definitely makes sense. For now this is still a very underexplored area as Linalg generic started allowing tensors recently. Multiple pieces are not yet connected but we are working on it in particular with collaborators also involved in XLA (@herhut, @pifon2a) and in IREE (@MaheshRavishankar, @asaadaldien).

I think this is the current workaround that @herhut and @pifon2a are using. In particular, some amount of fusion is done in loop.parallel. Note however that loop.parallel seems to imply buffer world, which would seem to defeat your purpose. There may be workarounds now that loops can yield values; one could yield a tensor. I am not yet familiar enough with these aspects.

However I think there may be a better way, see at the end of the message.

Not yet, we are still iterating on connecting all the pieces. The fact that linalg.generic can verify operations with mixed buffers and tensor operands should help. Still pieces are missing, in particular a naive buffer allocation scheme for Linalg to just allow end to end compilation of say MNIST on tensors as a first step. @pifon would know best but I think collaborators from DFKI are interested in this.

Once we have an end-to-end functional path we can experiment, learn and give some guidance. But I think it is too early atm, sorry …

OTOH this also means there are ample opportunities for contributing back and influencing the toolchain :wink:

Your description is accurate of the current flow: simple pointwise stuff can be fused directly at the tensor level and it removes some phase ordering issues related to:

  1. mark ops to be fused
  2. buffer allocate
  3. fuse on loops + buffer
  4. kill temporary allocations

For now reductions do not fit in this scheme and are indeed handled separately.

Another potential extension I have been thinking about, that could help, is allowing filters for each operand. This could make sense because atm, linalg.generic can be viewed as a multi-dimensional "map-reduce-apply" with a generic apply function. Making it "map-reduce-filter-apply" would not seem like a major challenge (it would make things more verbose but should play nicely with transformations). There will be more consequences on reliance on other canonicalizations, DCE, conditional hoisting and strength reductions down the line but it feels like a relatively natural extension.

In the general case, such filters could be relatively general functions that would be applied to the load / stores and maybe the production of the values in the body. In other words, filters would be a potential way of implementing your None type that would still be encoded in the attributes of linalg for easy retrieval and composition with transformations. This should be compatible with the design principles of Linalg.

Such filters could take different forms, either:

  1. integer sets, or
  2. more general functions that would take loop IVs as arguments (same as linalg.indexed_generic) and be inlined when lowering to loops and other things, or
  3. something more structured than 2. but more expressive than 1.

Personally unless there is a strong preference for 1. I would start with 2. because it brings us a step closer to gather-scatter, vector-predicate intrinsics and, later, non-dense types.

Anyway, the bottom line is: there isn’t something that does what you want today but that your None idea implemented as an extra filters attributes on linalg generic ops could be a very nice extension :slight_smile:. It should also be quite simple to implement, but we will need naive buffer allocation in Linalg to test things end to end.

@nicolasvasilache I don’t think Alexander Belyaev is on discourse (@pifon @pifon2a) :slight_smile:

Yes, I pinged him privately internally, he was travelling back from the US and will likely catch up tomorrow :slight_smile:

@bhack I am on discourse now!

And with the same username reserved by the mention :slight_smile:

There is a PR by DFKI collaborators (@dfki-mako, @dfki-ehna) that implements Buffer assignment that uses liveness analysis, etc.
https://github.com/tensorflow/tensorflow/pull/37212/. It will be used during HLO->LHLO conversion. Moreover, this version of buffer assignment is planned to be made dialect agnostic later, so that we can reuse the code somewhere else, e.g. buffer assignment for linalg ops, gradually lowering mixed tensor-memref form.

@pifon2a @dfki-mako

Thanks for the information.

I have a quick look of this PR. It seems that it does not consider dynamic shape scenario.

For example, in the following snippet, we will do some unique-like operation in the true branch. According to my understanding, the allocation point of %2 will be in bb0. However, the operands of this alloc rely on the shape of the result of that UniqueLikeOp, which is out of access at such point.

Is that right?

func @SomeUniqueLikeOp(tensor<?x?xf32>) -> tensor<?x?xf32>
func @test(%arg0: i1, %arg1: tensor<?x?xf32>) -> tensor<?x?xf32> {
cond_br %arg0, ^bb1(%arg1 : tensor<?x?xf32>), ^bb2(%arg1 : tensor<?x?xf32>)
^bb1(%0: tensor<?x?xf32>): // pred: ^bb0
%1 = call @SomeUniqueLikeOp(%arg1) : (tensor<?x?xf32>) -> tensor<?x?xf32>
%2 = absf %1 : tensor<?x?xf32>
br ^bb2(%2 : tensor<?x?xf32>)
^bb2(%3: tensor<?x?xf32>): // 2 preds: ^bb0, ^bb1
return %3 : tensor<?x?xf32>
}

This is currently being worked (@dfki-mako knows details). But the general idea is to have a first pass that inserts allocations directly where the operations are and then have optimization passes that try to move them early. The plan is to have that version ready early next week,

As already mentioned by @herhut and @pifon2a, we are currently working on this task.
We have updated our initial PR to support a new general BufferAssignment design, which we have already discussed internally with @herhut:
Some parts need to be considered for legalization, and some parts of the general BufferAssignment task should be done in a separate pass.
For example, signature conversion and adjustments of the actual operations (like adding an additional output buffer operand) are highly dialect dependent and should be considered in the legalization or another legalization pass.
However, other parts, such as optimized placement of allocation and dealloc nodes based on live areas of buffers, can be performed automatically.
Our current plans are to extend this initial functionality iteratively with some follow-up PR:

  • Generic and reusable version of the signature conversion in the class BufferAssignmentLegalizer.
  • Integration of the BufferAssignment pass into the legalization phases HLO->LHLO and HLO->Linalg.
  • Support for dynamically shaped types in the internal BufferAssignment pass.
  • Support for loops in the internal BufferAssignment pass.
  • A common interface for specifying details of possible aliases in relation to operands.

To answer your initial question: Yes, there is currently no support for dynamically shaped types. This will be changed in the near future :slight_smile:

I am intrigued about the map-reduce-filter-apply proposal. It would be really useful to see some more concrete details of what that would look like in Linalg. For some context of my interest here (coming from IREEs use case). In the IREE codegen pipeline (for GPU) this is the path we are following

  • Convert XLA-HLO to Linalg on tensors
  • Fuse linalg on tensors
  • Convert to linalg on buffers
  • Tile to get loops that can be distributed to workgroups and workitems (i.e blocks and threads)
  • Lower to SPIRV
    In short order we will add fusion using linalg on buffers as well. We are realizing that some operations make sense to fuse on tensors and some on buffers.

I see that there is a slight difference in how linalg on buffers and linalg on tensors work. In the buffer world, the iteration space of the computation is computed using all the arguments (inputs and outputs). Example matmul, the iteration space is MxNxK while the data spaces are MxK, KxN and MxN. In the tensor world though, it seems more natural that the iteration space is determined only by the output tensors (at least for the ones converted from HLO to linalg on tensors in IREE). So it makes sense to use linalg on tensors for element-wise operation. When it comes to expressing matmul, or really any linalg operation that has “reduction” iteration type, I am not sure the linalg on tensors captures the right semantics. A similar thing seems to be happening in the example from @wyzero in the first post. There are multiple outputs with different “volume”. So trying to determine the iteration space using these (which would be the hull of all the outputs) means some part of the iteration space will not contribute to the data space of some outputs.

So the question is how do you filter off the computation within the region of a linalg generic op in a way that allows writing to outputs of different sizes, while allowing for transformations like tiling to apply more naturally.