affineDataCopyGenerate and parallel loops

Currently affineDataCopyGenerate only handles sequential loops. Let’s consider a case that involves parallel loops:

affine.parallel %x = 0 to 32 {
  // other ops...
  affine.for %y = 0 to 10000 {
    affine.for %z = 0 to 7 {
      %v = %buf[%x + %y * 7 + %z]
      // use %v
    }
  }
}

The intention is to create a buffer for %x and %z, but not %y. In this case I want:

%cache = alloc() : memref<38xf32> // [0, 32) + [0, 7) = [0, 38)
affine.parallel %x = 0 to 32 {
  // other ops...
  affine.for %y = 0 to 10000 {
    // slow but semantically equivalent to faster versions.
    affine.for %i = 0 to 38 {
      if (%i mod 32) == %x {
        %cache[%i] = %buf[%y * 7 + %i]
      }
    }
    // this op doesn't exist yet.
    // On Nvidia GPU, %x maps to threads in a single block,
    // and sync is nvvm.barrier0
    sync %x
    affine.loop %z = 0 to 7 {
      %v = %cache[%x + %z]
      // use %v
    }
  }
}

It requires two changes of assumptions in the current implementation, given the input region R (in case above, loop %z) under analysis:

  • Index values outside of R have a single value over the execution of R. Not the case anymore.
  • alloc() is always before the prologue. No the case per example above.

Proposed change:

  • Add a separate positional input for affineDataCopyGenerate, indicating where to allocate. Let’s say we want to create an AllocOp before region A (in the case above, %x).
  • Assert that A is an ancestor region of R.
  • For each parallel loop that (1) has an ancestor A, and (2) is an ancestor of R, add them to the MemRefRegion analysis.
  • Figure out a way to efficiently copy data over, or delegate it to the users (e.g. require GPU to implement something like dma.{start,wait}).

Any thoughts?

Does the caller have to provide such a location? Or can we always compute this position unambiguously (outside the outermost affine.parallel)?

I think it becomes clearer if we restate this. It looks like you need a buffer for a region ‘for all %x, %z’ but for a given %y? affineDataCopyGenerate is currently designed for being symbolic in all loop IVs from the outermost up until a certain desired depth (and other top level symbols), i.e., for a given value of those, and then the buffer is sized based on the data accessed for the entire extent of all inner loop IVs. OTOH, in your case, you may want to pick (in a more general sense) from among your IVs, and there is no clear notion of a region based on outer as symbols and the inner IVs being your region.

Reg. the placement of alloc (also related to @sanjoy_das_google 's question), wouldn’t you want to place your alloc right outside the outermost among the loops that the region isn’t symbolic on, in this case, %x? Isn’t that the desired placement by default? Note that the alloc’s could be hoisted up further if there are surrounding loops, and affineDataCopyGenerate already has logic to hoist the copy placement point further out (both alloc’s and transfers); an independent utility to do that could also be built out.

If the above is true, then it looks like you need the generalized affineDataCopyGenerate to take a list of IVs that you want for your region - the rest are the ones you are going to be symbolic on? The current implementation would then become a special case of that. The issue of supporting affine.parallel appears orthogonal.

For my use case, the alloc() is PTX shared memory, so it should be always placed outside of parallel ops that represent threads within the same block (%x), not necessarily top-level.

Sure, but nonetheless the prologue (%i) is placed within some non-symbolic loops (%x). The point is that alloc is significantly far away from the prologue. See the last for the formality.

The current logic never hoists ops out of non-symbolic loops, not to mention that hoisting an AllocOp out of a affine.parallel is non-trivial (unlike hoisting out of affine.for) and requires MemRefRegion analysis.

Without affine.parallel, symbolic loops are always consecutive in the sequence of ancestors (from the load to the root), hence driving no needs for the proposed change you and I mentioned (“to take a list of IVs that you want for your region”).

Formally, let’s say that X is the ancestor sequence of the load with size n, starting from root. In the case above, it’s [%x, %y, %z].

Before the change, there exists a unique i such that X[0 … i-1] are symbolic, while X[i … n] are not.

After the change, there exist (i, j), such that:

  • X[0 … i-1] are symbolic {loop, parallel} ops;
  • X[i … j-1] are symbolic loop ops or non-symbolic parallel ops.
  • X[j … n-1] are non-symbolic {loop, parallel} ops.

X[i] would be region A in the original post, and X[j] would be region R.

Right - but it isn’t meaningful to place it anywhere inside the loops that define your region (in this case %x, %z - because the region is computed for all iterations of %x, %z). Anywhere further up should also be fine since the buffers being created are of a ‘constant bounding size’ - so they don’t depend on any outer IVs or top level symbols.

Right. Note that the prologue will have to be placed inside %y here. It can’t normally be hoisted past any of the things that the region is symbolic on unless the transfer itself is invariant on that symbol.

The non-symbolic loops will never appear outside of the alloc/transfer/copy loops. They are all dominated by the prologue.
The hoist check now is simple: if the memref region is invariant on a symbolic loop (i.e., loop IV doesn’t appear in the region’s symbol list), it would hoist past it.

I see - you are saying this use case makes sense only for affine.parallel’s and not when you just have affine.for’s. That’s perhaps true, but let’s say in the future, the affine.for had a parallel attribute (we probably may not have one and instead reuse 1-d affine.parallel), what you propose here would be useful in that context as well.

This is a nice way to frame things and come up with the input spec - let’s use your terminology. In what you state above, you are allowing affine.for that are to be treated symbolically for the region to potentially appear after ones that shouldn’t be (in the X[j…n-1] part). But this isn’t really a meaningful input per your comments above and otherwise as well. Could we explore simplifying/generalizing this because I think there is a more intuitive input spec (which will also simplify implementation) and it’ll also meet your use case. How about the following (using your terminology above)?

You have X[0…n-1]: ignoring all affine.parallel ops in this sequence, you have a sequence of 0 or more leading loops (affine.for) which are to be symbolic for the region followed by the remaining affine.for’s which are region dimensional (non-symbolic). affine.parallel ones could appear anywhere in that sequence and they could be specified to be either symbolic or non-symbolic. Examples below.
(L = affine.for, P = affine.parallel, S = symbolic, D = non-symbolic)

Allowed: [LS0, LS1, PS0, LS2, PS1, PD1, LD]
Allowed: [PD0, LS1, PS0, LS2, PD1, LD]
Alllowed: [LS0, LS1, LD0, LD1]
Alllowed: [LS0, PD0, LS1, LD0, LD1]
Disallowed: [LD0, LS0]

On another note, I have a couple of patches to affine-data-copy-generate (one to clean up things/test cases, and one to fix/support min/max loop bounds) that I will post for review shortly. If you plan to build out something, you could use that as the base once they’ve landed.

I made a mistake in the original comments, and now it’s edited as follows:

  • X[0 … i-1] are symbolic {loop, parallel} ops;
  • X[i … j-1] are symbolic loop ops or non-symbolic parallel ops.
  • X[j … n-1] are non-symbolic {loop, parallel} ops.

I believe that my intention is more constrained than your proposal, but I made the typo. For example, I wouldn’t allow [PD0, LS1, PS0, LS2, PD1, LD], as it’s just hard to implement nor useful - the prologue needs to predicate on PS0 == 0 or something (since it’s symbolic, it has to have a single value).

Let me try to rephrase your version into my constrained proposal:

  • For all affine.for ops in X[0 … n-1], there are 0 or more leading symbolic ones followed by non-symbolic ones.
  • For all affine.parallel ops in X[0 … n-1], there are 0 or more leading symbolic ones followed by non-symbolic ones.
  • The prologue must be placed between the last symbolic affine.for and the first non-symbolic affine.for.
  • The prologue doesn’t have to dominate all non-symbolic affine.parallel ops. However, a sync is required after the prologue.
  • The AllocOp must be placed between the last symbolic affine.parallel and the first non-symbolic affine.parallel.
  • The AllocOp must dominate the prologue.

This makes a lot of sense to me - the earlier typo had completely changed the spec. So basically you have a middle part where symbolic loop ops and non-symbolic parallel ones are potentially interleaved. A few comments on your bulleted list.

  1. Reg. the 5th bullet, shouldn’t the alloc op actually be placed before any non-symbolic op (whether affine.for or affine.parallel)?

  2. The fourth bullet is the part that I don’t understand yet - this is key. Why wouldn’t that lead to every thread getting the same thing repeatedly (for the number of iterations it executes). What’s the difference b/w placing it just above the first non-symbolic op and below some non-symbolic affine.parallel op?

True, and I think it’s implied by the transitivity from (3), (5), and (6), so I didn’t spell it out. These rules force the following order in X: AllocOp → first non-symbolic affine.parallel (or none) → prologue → first non-symbolic affine.for

It would. Hence in the original example I have to use if (%i mod 32) == %x to avoid so. In practice, the prologue better be sharded for all threads to leverage performance benefits (load coalescing, PTX block-level cache, etc), and also avoid non-atomic stores to the same memory location. I’d very much prefer users to lower an opaque data copy op (similar to or the same as dma.{start,wait}) to handle these details.

If we move the prologue up to some non-symbolic parallel op, it may move across some symbolic affine.for op. That would invalidate the program - the prologue uses all symbolic IVs (in the original case, %y).

Overall, all of this sounds good to me. A couple of comments below though.

Although this seems to be too target specific, it sounds fine to have this placement flexibility to be provided by the utility (as far as where the prologue comes). But how are those conditional guards derived and how is their generation specified in a target independent way?

Another comment reg. hoisting out of affine.parallel that you mentioned earlier:

Hoisting out of affine.parallel doesn’t require any memref region analysis (if just wanted to extend the current hoisting implemented). You’d have to just check whether the region arguments of the affine.parallel appear on the MemrefRegion just like you check whether an affine.for’s IV appears on the region’s symbols (for a symbolic affine.for). You will although need control / a flag to disable the automatic hoisting if a client wants it placed at the specified position.

How do you plan to split implementation for this support? It looks like there are several non-trivial components. I’d also recommend splitting out / refactoring affineDataCopyGenerate first – it has become really long. And I am happy to help with this.

Not easily. That’s why I strongly prefer them to be lowered later by target-specific passes. affineDataGenerate only generate dma-ish ops.

I was talking about a different kind of hoisting. In general, in a function with affine.for ops only, one feels free to hoist a AllocOp (but not the prologue!) all the way to the top-level, without changing the shape of the allocated buffer:

affine.for %x = ... {
  affine.for %y = ... {
    %buf = alloc() : memref<2x3xf32>
    // use %buf
    dealloc %buf
  }
}
=>
%buf = alloc() : memref<2x3xf32>
affine.for %x = ... {
  affine.for %y = ... {
    // use %buf
  }
}
dealloc %buf

…because the different instances of %buf in the body don’t have overlapping liveness. However it’d be wrong if some of the loops are affine.parallel:

affine.parallel %x = 0 to 20 {
  affine.parallel %y = 0 to 30 {
    %buf = alloc() : memref<2x3xf32>
    // use %buf[...]
    dealloc %buf
  }
}
=>
%buf = alloc() : memref<20x30x2x3xf32>  // memref<2x3xf32> is wrong
affine.parallel %x = 0 to 20 {
  affine.parallel %y = 0 to 30 {
    // use %buf[%x, %y, ...]
  }
}
dealloc %buf

…since the body of an affine.parallel can be executed in parallel, semantically speaking the alloc() in the body can also allocate memory in parallel (one alloc() call per thread). Hoisting such an alloc() out of affine.parallel needs to change the shape of alloc() to start with, and there may follow clever tricks to reuse some of the overlapped memory (like one exhibited in the original post).

That will be GREAT!

I already contributed generateCopyForMemRegion, so that’s a good start. :slight_smile: In terms of the rest, as I speculate:

  • For MemRefRegion, we know that we want to support two kinds of loopDepth for compute(), one for affine.for and the other for affine.parallel. And hopefully the FlatAffineConstraints computes the correct bounding box automatically.
  • generateCopyForMemRegion needs to be more flexible in terms of AllocOp / DeallocOp / prologue / epilogue placement, which is easy to achieve.
  • generateCopyForMemRegion also needs to rewrite all AffineMaps correctly, w.r.t. the prologue / epilogue, and the original load / store. This is a complicated topic, and it can take up a full separate post (that I will create).
  • The prologue and epilogue has to be a pair of abstract operation. Their semantics are quite clear - to copy the whole data over using the parallelism provided by non-symbolic affine.parallel ops. The implementation is highly platform-dependent, consider the lack of basic sync operations in the Standard Dialect. I don’t know much about dma.{start,wait}, maybe we can reuse them, or tweak them, or invent new ops. I’ll take a look at this.

dma.start/wait aren’t connected to parallelism - they correspond to the issue and finish respectively of a non-blocking transfer. I think it’s worthwhile to see how you would like to model the data transfers: is your goal to just use/generate the point-wise copy to start with (just loops around load/store)? Could your transfer itself be modeled using affine.parallel’s (with load/store’s under them)? It would also be interesting to see if a new op that is the data transfer equivalent of affine.parallel would be useful, i.e., it’d be inherently multidimensional and hyperrectangular like the dma.start/wait/affine.parallel ops.

Btw, the first cleanup patch landed, and the second one is here D77320. Once this one is in, we could refactor it with no functional change.