Linalg.generic for multiple, different reduction dimensions

Hi all,

we wonder if linalg.generic can be used to express computations based on multiple, different reduction dimensions, e.g. the MBBS example in this PLDI’19 paper. The (simplified) MBBS code in C is:

// init
for( int i_1 = 1 ; i_1 < N_1 ; ++i_1 )
  out[ i_1 ] = 0;

// case: i_1 == 0
for( int i_2 = 0 ; i_2 < N_2 ; ++i_2 )
  out[ 0 ] += inp[ 0 ][ i_2 ];

// case: i_1 > 0
for( int i_1 = 1 ; i_1 < N_1 ; ++i_1 )
{
  for( int i_2 = 0 ; i_2 < N_2 ; ++i_2 )
    out[ i_1 ] += inp[ i_1 ][ i_2 ];

  out[ i_1 ] += out[ i_1-1 ];
}

Here the intermediate results of the i_1 loop are combined by prefix-sum (also known as scan) and the results of the i_2 loop are combined by addition “+”. Can this example be expressed in linalg.generic? If so, how? We noticed the following comment in the MLIR code here, but for the combine operators prefix-sum and addition correctness should not be violated.

Thanks in advance for any help!

As a rule of thumb, a single linalg.generic models a perfectly nested computation.
While you may technically be able to express this as a single perfectly nested loop nest with proper conditionals, in practice we have found this approach to be counter-productive.

In this case, I see 2 possibilities forward:

  • write this as (affine) loops and implement the proper analyses and transformations to get the output you want
  • introduce a scan / segmented scan op on tensor/buffer and plug it into the system. Note that we have such an op at the vector level: VectorOps.td - llvm/llvm-project - Sourcegraph.

If you go with the second option, you can then rewrite this using the tensor comprehensions notation (assuming inplace buffers):

out(i) = 0;         // fill
out(i) += inp(i, j) // reduce
scan_op(out)        // scan 

The reasons you may want such a scan op at the tensor/buffer level are:

  • it can lower unsurprisingly to the vector abstraction for which we have (or at least expect to have, I haven’t dug yet) unsurprisingly good retargetable lowerings.
  • its semantic carry the information that the iteration domain can be logarithmic, which is not trivial to recover from the loop domain.

More generally, I think there is design space for another type of “generic” op with a region and logarithmic iteration spaces but my intuition is it does not compose with reductions (or at least as we currently represent them, I have not dug into this topic deeply enough).

Note that a pragmatic choice here—rather than designing the next level of generalization along this axis—is to just use ops with the proper interfaces that connect to the vector level, see e.g. IREE’s LinalgExtOps.td - iree-org/iree - Sourcegraph.

Deriving implementations that are more or less fused/parallel/vectorized can then be done with transformations.

The above is for the “representing” scan part as a single linalg.generic vs multiple ops. The TL;DR is “use multiple ops”.

Separately, there is the question of having an aggregate op, e.g. in your case a single xxx.mbbs op in some dialect. This is very similar to @harsh-nod 's great work on Winograd convolutions (in his presentation he also goes through the tradeoffs of implementing this as a single linalg.generic).
In both cases, the outcome seem to be that we need a first class “sequence” or “dag” construct that we can use to build new ops and have transformations like tiling, fusion etc carry to that.
My intuition is there isn’t something generic to represent all cases, but making the process simpler and more automated will be very useful.