Placement representation in tensor world & buffer world

We need the placement representation in the IR in the design of DISC. I guess this would be a common issue for other host-device joint compiler stacks. A PlacerPass in the tensor world is usually more feasible for a number of reasons. So we need the placement representation in both tensor world and buffer world.

For tensor world:

Our current proposal is to add a custom attribute on mhlo dialect, an example is:

%209 = “xla_hlo.d_reshape”(%arg1, %208) {xla_dhlo.device = “cpu”} : (tensor<?xi32>, tensor<2xi32>) → tensor<?x1xi32>

Another example for a multioutput node:

%5387 = “xla_hlo.d_topk”(%5384, %5386, %73) {dimension = 1 : i64, xla_dhlo.device = [“gpu”, “gpu”]} : (tensor<?x22605xf32>, tensor<?x22605xi32>, tensor) → tuple<tensor<?x6xf32>, tensor<?x6xi32>>

This works fine in our current codebase. However, there’s a risk that other mhlo passes may not properly handle a custom attribute, mistakenly drop it for example, or the replaced ops might not correctly inherit the placement attribute in some kind of mhlo optimizations. So, I can think of two solutions at the moment:

1, Add an “official” attribute in mhlo Dialect, and hope all the mhlo layer optimization passes properly inherit it.

2, Add a memory space property in TensorType.

From my point of view, I personally prefer 1 since it’s not intuitive for a tensor to have a memory space. But I still have a little bit of concern for solution 1, which is, how can we actually make an attribute ‘official’?

Pls let know if you have any better ideas.

For buffer world:

It should be OK for us to just use the MemorySpace attribute in MemRefType. But I still have a few understandings may need to be confirmed:

1, the ‘MemorySpace’ may contains two levels of information: the memory hierarchy (alloc vs alloca), and the memory type (host0/host1/device0/device1/device2). How to interpret it is user-defined.

2, the 'MemorySpace of MemRefType will not be strictly associated with the ‘address space’ of llvm

Does it make sense to create a dialect with a host_tensor and device_tensor type and use these types to represent a placed MHLO graph? This dialect should also include the H2D and D2H HLOs.

The MHLO dialect can then depend on this dialect and MHLOs can be generalized to allow host_tensor and device_tensor as input/output.

I would rather add a new property in TensorType, than to create two new types of host_tensor/device_tensor. New types means we need a type conversion somewhere, which in my understanding is far more complicated.

This all sounds very specific to certain types of implementations and likely calls for some dialect specific mechanisms specific to what is trying to be done.

As a fly on the wall: types are very heavy for such things and make the kind transformations you will seek to do tricky (ie. Try to place on a device, fall back to host, etc). Have you considered an op centric approach for signaling or outlining transitions?

Have you considered an op centric approach for signaling or outlining transitions?

Not sure if i get the point, would you explain in a little more detail?

The “MemorySpace” is fairly opaque at the moment, you can use it to model what you want. It’ll likely have to be specific to a target system I think.

Correct, you can use it to model what you want in MLIR. This is why it evolved to just an Attribute recently.

We could do that, but we’d need to think about what it models exactly. For example would it model the multiple cores of a device? XLA can do this kind of things right now, but can’t model the host computation I believe.

We recently added an “encoding” attribute on the TensorType which can support this kind of things as well.

Have you looked in TF dialects and device launch op? This has the added benefit of optimizations not accidentally adding additional transfers.

As the codes for TPU are not fully open sources, I need to confirm my understandings for tf_device.LaunchOp first: what would you put into the body region of tf_device.LaunchOp?
1, one op, or a kernel;
2, a subgraph;
if 2, we’ve considered similar approches, which is to put the device subgraph into a region of some kind of wrapper op. However this will bring additional complexity in order to support UniqueOp in dynamic shape semantics in future. The graph has to be seperated into multiple device::LaunchOps and there has to be a pass to make a correct clustering.
if 1, i don’t see any additional benefits than just to add an attribute. Please correct me if i’m wrong.