[RFC] Implicit propagation of dialect attributes (best effort)

Hi all,

I’ve been trying to address an issue we have with the TensorFlow dialect, but that may apply to other domain as well related to the implicit propagation of dialect attributes.

In TensorFlow, every operation can have an optional device attribute. It isn’t an attribute defined by the operations themselves and it isn’t a property of each operation in particular. Such attributes aren’t exposed through ODS generated accessor and aren’t managed directly in DRR. As such any transformation as simple as the canonicalization of tf.Add into tf.AddV2 will drop the device attribute.

This is particularly limiting as right now we can’t always convert a TensorFlow graph into MLIR, canonicalize it, and convert it back to a TensorFlow graph that can be executed: preserving some attributes is unfortunately important for some test cases.

I wrote a simple DialectInterface for Attribute Propagation, which allows dialect to control the propagation of dialect attributes.
The hook signature is simply void propagateAttribute(NamedAttribute attr, Operation *op) ; it is invoked after creating a new operation op with a NamedAttribute for this dialect (we lookup based on the key, so for TensorFlow we would have the attribute look like tf.device = "...").
For interacting seamlessly with any kind of pattern rewrites, we augment the OpBuilder with an optional DictionaryAttribute. Every time an operation is created with the builder, the dictionary is iterated on and each matching attribute is dispatched to the interface with the created operation.
This DictionaryAttribute can be manually set on the builder by a transformation, but it is automatically set to the current dictionary for an operation when the OpBuilder is constructed from an existing Operation. The driver for apply pattern will also set this dictionary to the root operation matched by the framework. As such every existing pattern will consider by default to propagate the attributes from the root operation (the implementation of the pattern can override this by resetting the dictionary on the provider rewriter).

Feel free to check the revision I linked above (D95394) with an example in the TestDialect.

This would be useful functionality to have. But two questions / concerns:

  1. With this approach, wouldn’t all created ops get the attributes of an op that was used for the insert point? This looks arbitrary outside of all pattern rewriter use cases! (where there is no op replacement but say just IR injection happening). It looks like you’ll need a new OpBuilder ctor or to explicitly use setPropagatedAttributes on the builder when needed?

  2. The added overhead on the creation of every op is a concern. Even if there are no dialect name prefixed attributes, you are cycling through all of the “insert point op’s” attributes (op’s intrinsic attributes as well as any of its user-defined attributes) in all cases where the OpBuilder was constructed using the ctors that take an Operation *. It is also weird that there would be a workaround to avoid this overhead by creating the OpBuilder via its “insert point” ctor (propagated_attrs would be empty in that case) or some of the other ctors?

We could remove the convenience of initializing the propagatedAttrs on OpBuilder construction and let the user do it if it feels like it could lead to mistakes.

In TF dialect today every op has effectively a device attribute in the NodeDef. It is not represented in the dialect as an actual attribute today and lumped in with all other unregistered attributes. I feel like there is a large difference between an attribute that is part of the op (but is currently modelled differently) and a general propagation mechanism for attributes. I am concerned about just propagating “metadata” with unclear semantics freely. This feels like a mechanism to solve a symptom.

Note: this is true for _XlaCompile & _output_shapes (there is also an XLA shape one), and a host of others too. And the current behavior in TF is (to a first approximation) to propagate all of these except if it is on an identified list (e.g., most are always propagated, some are never propagated). Which also means that the shape inference pass needs to be aware of _output_shapes attribute (and the corresponding XLA one) unless it wants to create inconsistent/invalid IR.

I’d rather not reinforce that and have passes be blindly propagating until we get to a pass that has to finally reason about them and so could fail, but the failure is long post when the attribute should have been reasoned about/not incorrectly added.

Or is the proposal here that all of those attributes will become explicit, registered attributes of the operations? (it feels weird to say there are attributes that are so important that they have to be retained, but not important enough to capture, but also this is not a closed set that can be captured).

This is a special case of having a hook over all ops post conversion (with potentially a source op or multiple to [as here a “source” for attributes]). Why not just post-process? Or allow such general post-create hooks? (could even be post OperationState creation, before Operation construction)

It feels to me that the propagateAttribute hook and how it is positioned caters for cases where there can be no invalid interactions of these propagated attributes (or every pattern now has to explicitly reject matches) and these attributes have no effects on optimizations. I wonder about cases where two attributes are not compatible on the same op: so the pattern writer would now need to locally check that the builder with which it was invoked has attribute X, as the pattern unconditionally adds attribute Y to an op but then you’d end with an op with attribute X and Y, which aren’t allowed together. This seems to move patterns to have deny lists locally (except if there are never incompatible attributes - for which I think their needs to be requirement placed on these propagated attributes from the start).

device is not a convincing case for me here: For device, I’d only want to do some optimizations or rewrites if on the same device. So not dropping is one thing, but saying patterns don’t have to reason about it seems wrong (and yes, I could of course now add additional predicate that queries the device for my pattern, but it is now easier to ignore device placement, so it sets the default to not caring about device placement). In the previously proposed structural modelling of devices/launches there, we’d naturally not incur additional device-to-device communication such as the structure would prevent, here it becomes implicit/opaque, and auto propagation makes it easy to ignore IMHO.

_XlaCompile similarly requires consideration: else you end up rewriting an op that can be compiled to equivalent ops that can’t be, but you retain that they must be compiled. So you end up creating a graph that can no longer execute or compile.

This is still under the control of the dialect behind the interface: nothing is really made with “unclear semantics” as these are dialect defined attributes.

We’re getting a bit too much in the specific of TensorFlow here: the TensorFlow Dialect can ultimately manage it’s own list of attribute with its own policy. I’m just not sure there is a principled solution to match the existing TensorFlow implementation with is, as you mentioned, not very principled in the first place.

The proposal here is not about operation dependent attribute but dialect attributes exclusively, i.e. “metadata” with little guarantee.

Yes this is correct, we could make it a general “post creation” hook, but that raises more question about which dialect is called post-creation. Here you have a clear dispatch point which is the named attribute, so this hook is really coupled to this notion.
(I considered the OperationState, but I figured post-creation is richer as you have informations that are added by the builder code itself, like result type inference for example).

Yes, that said this is also why the hook is invoke post op creation: the dialect implementation can enforce any validity check for the propagation (including calling an op interface for this dialect attribute propagation).

I would encode this in the dialect hook implementation.

Again: this is to be addressed inside the hook implementation with a policy there.

I think the ability to propagate certain attributes would be super useful, at least for the TF dialect. But I’m a little skeptical about the approach here, in particular, letting a central dialect hook control the attribute propagation behavior for all attributes. I think what we actually want is two different types of attributes:

  1. “Normal” attributes which belong to an op and should not be propagated through IR transformations.
  2. “Persistent” attributes that need to be propagated through transformations until they are consumed at some point (e.g., device attributes or certain XLA attributes). They don’t belong to an op and therefore shouldn’t be explicit attributes for any ops.

Can’t we do the following:

  • Model “persistency” as a property/trait of an attribute (instead of having a central hook in the dialect that decides about what to propagate and what not).
  • Add persistent attributes to a dialect as needed. E.g., the “device” attribute would be added as a persistent attribute to the TF dialect (but it wouldn’t be an explicit attribute of any TF op).
  • When manipulating IR, only propagate attributes that are persistent (as long as we stay in the dialect).

This would have a clearer separation between the two different types of attributes (see 1. and 2.) and should be more efficient because we wouldn’t need runtime checks (as in the dialect hook) for whether an attribute should be propagated, we could simply deduce it from the attribute type traits.

I would encode this in the dialect hook implementation.

This is still not clear to me, how would general DAG-to-DAG patterns be handled by default if attributes for the input ops are not consistent?

This is codified in LangRef actually: MLIR Language Reference - MLIR

This is an option, but only attribute types can have trait/property. What you’re proposing would imply that tf.device can’t be a StringAttr, the Dialect would have to create a specific attribute type just to be able to attach a trait. Another issue is that you would be defining this on the attribute type, while an instance of this attribute can be used as “dependent” on an op. I guess we could enforce that an attribute with such trait can only be a dialect attribute, but at this point what have you achieved that makes it better / more convenient than letting the dialect control on the key?

Finally, on top of just enabling the “persistent” aspect, the hook I implemented also let the dialect specify more condition for the propagation (for example that it propagates only to operations in the current dialect, or only on operation with a given type, etc.).

Are you talking about DRR (TableGen pattern) or general DAG-to-DAG patterns in C++?
For the former, there is no good story (but I’m not sure how a trait on the attribute type would help either?), for the latter, it defaults to the root operation, but it can be overridden in the code with any custom logic.

This is codified in LangRef actually: MLIR Language Reference - MLIR

I think we are mixing different things here (persistent attributes != dialect attributes): The language reference talks about dependent and dialect attributes but not about attribute propagation which is orthogonal. I don’t see why we shouldn’t have some dialect attributes that are persistent/propagating and others that are not (we probably wouldn’t have propagating dependent ops). Maybe it was misleading that I talked about attribute types, let’s rather call it propagation type.

This is an option, but only attribute types can have trait/property. What you’re proposing would imply that tf.device can’t be a StringAttr, the Dialect would have to create a specific attribute type just to be able to attach a trait.

I see, what about adding a template parameter for the propagation type to Attribute (default = non-propagating) and inheriting classes (like StringAttr)? I guess that would be a big change so not sure if that’s an option…

Another issue is that you would be defining this on the attribute type, while an instance of this attribute can be used as “dependent” on an op.

I think it makes sense to define this property on the attribute type. Why would you want different propagation behavior based on where this attribute type is used? I would find such behavior intransparent.

but at this point what have you achieved that makes it better / more convenient than letting the dialect control on the key?

The main achievement would be a simpler mental model and easier usage: An attribute can be either persistent/propagating or not, and you can see it directly at the type. And also, this would be a compile time property so it would be more efficient. The dialect hook approach is very flexible but I don’t know if the additional flexibility is needed here. My concern is that it’s harder to reason about the propagation behavior of attributes if there is centralized logic in a dialect hook, compared to having the behavior encoded in the type (via propagation type).

for the latter, it defaults to the root operation, but it can be overridden in the code with any custom logic.

I think this is a dangerous default because inconsistent attributes might be merged together in a non-obvious way, resulting in loss of information. For attributes like “device” (and others that effectively represent a partitioning of ops) we would not want to apply patterns when attribute values are inconsistent (e.g., we can’t apply a 2-to-1 pattern for ops where one device is TPU and the other one is CPU, this corresponds to the behavior after building TF device clusters). So should the default rather be to not match in such cases? Or maybe the behavior could be encoded in the propagation type?

It seems to me that a lot of things are “integer”, “string”, or “boolean” and you still may want to propagate them or not. I believe that in general what actually makes them to be propagated is a not a property of the type, but rather a property of what they describe (like tf.device). This is why I’m trying to anchor there: this is where the semantics part of the attribute really is.

I don’t really want to get too much into the details of what is suitable for TensorFlow and device though. What I understand here about device is that we shouldn’t use any implicit propagation for device and it should be manipulated explicitly. It is a possibility, but then it becomes slightly orthogonal to this interface.

It seems to me that a lot of things are “integer”, “string”, or “boolean” and you still may want to propagate them or not.

Right, hence the suggestion to basically parametrize attribute types w.r.t. to propagation types. But I agree, it would be nicer to store the propagation type per specific attribute (like tf.device) and not per attribute type which is what I suggested before but you mentioned earlier that we don’t have traits/properties for attributes, only for types.

I believe that in general what actually makes them to be propagated is a not a property of the type, but rather a property of what they describe (like tf.device). This is why I’m trying to anchor there: this is where the semantics part of the attribute really is.

I see your point. My thinking was that propagation behavior is such an essential property of an attribute that it should be modeled directly at the attribute and not in centralized dialect logic. Basically similar to op traits, just for attributes. But there doesn’t seem to be an easy way to do this. I agree that the attribute type might not be the best place to model it, but isn’t it possible to model it per specific dialect attribute (see above)? So basically each dialect attribute would have a specified AttributeType and, in addition, a propagation type or trait.

In general, I’m +1 to building out support for adding an interface for better supporting dialect(-prefixed) attributes. Support for these right now is fairly adhoc, inconsistent, and with no way to understand any context behind them. For example, FuncOp implicitly tries to (and here I mean the patterns that use it, not necessarily FuncOp itself) propagate all dialect attributes for the simple reason that it has no idea how to understand when they should be propagated. The problem of attribute propagation, and merging, in MLIR is a much more difficult than (likely) most other compiler infrastructures. We need some way for generic transformations to be able to reason about these things, at least if we want to be conservatively correct/optimal.

The above is a statement of support in general for such an interface, now for a bit of discussion on the currently proposed hooks. (I separated these two because I feel the exact semantics of the interface can be discussed separately from the fact that I think it should exist). The current interface is structured (or at least, as some of the comments in the revision lead me to believe) very tightly with the pattern rewriter. I would prefer we structure this interface in such a way that applies more generally, i.e. most of the places that I would be interested in using this interface would not be in the rewriter but in generic transformations (Inliner, CSE, etc.). Looking at it this way, framing the current hook as purely “propagation” seems limited. As others have mentioned, how do you imagine that we support merging attributes between operations? More specifically, how could we support something like the “mergeMetadata” hook in MLIR?

Along with the above, but could we instead create a more proper dialect(-prefixed) attribute interface that could contain the current verification hooks that are on Dialect?

As for the implicit propagation of these attributes, I have some slight reservations along the same likes as @jpienaar that implicitly propagating these attributes in all situations is not necessarily desirable. I don’t have any specific suggestions at the moment, but I think this is a real problem that we need to address.

– River

I’m not sure what explicitly stating a propagation type would get you? How does this improve the situation? A dialect can already verify the types of the dialect(-prefixed) attributes with the verifier hooks in Dialect.

– River

I’m not sure what explicitly stating a propagation type would get you? How does this improve the situation? A dialect can already verify the types of the dialect(-prefixed) attributes with the verifier hooks in Dialect.

The main advantages I see are simpler mental model, clearer semantics and easier usage: If you could just add a trait to each (dialect) attribute which clearly specifies the desired propagation behavior for this attribute (like you add traits for ops, e.g. IsCommutative), then this behavior would be tightly coupled with each attribute. In contrast, with the dialect hook you basically have a central case distinction in the hook that decides how every attribute is propagated which is less transparent to me (very flexible but unclear semantics). I’m not saying it doesn’t work but I would prefer a solution where a clearly defined propagation type (e.g., ‘never propagate’ or ‘always propagate inside dialect’, I’m simplifying here) is part of the attribute in some way. As an analogy (not perfect but similar enough), for ops we also want to store properties directly at the ops (as traits) rather than having some central logic decide which ops have what kind of properties.

Now I’m not that familiar with the related code so I don’t know if such dialect attribute traits (or similar) are feasible.

Personally, this seems like a solution with a particular narrow use case in mind. It seems to me like this is actually quite a broad problem and it might often make sense to transform or merge attributes in this process as well. Personally I would feel better about this as a good solution if it solved a family of related problems, rather than just a single use case.

We brainstormed a bit about this in a group, and here are some thoughts:

  • Implicit propagation of attribute can be unsafe. The dialect which defines the attribute isn’t always in a good position to decide if an attribute can be propagated. For example information about aliasing can be invalidated by a pass, but the dialect callback wouldn’t be able to know if it is safe to propagate. A loop transformation can turn a parallel loop into a sequential one by introducing reuse across iteration, but if an attribute carrying the result of a previous analysis on the loop is propagated over, it would be incorrect.
    It seems that the number of situations where an attribute can be propagated this way is fairly limited.
    The current mechanism works in context like TensorFlow where you would perform graph transformations and stay within the same dialect, but it wouldn’t scale much further beyond this.
    LLVM Metadata can always be dropped, but more importantly they are explicitly propagated only when a pass understands the metadata. The important part is that the transformation semantics is taken into account in the decision of propagating or not. This is critically missing from the current proposal where the dialect hook can only decide to propagate an attribute based on the newly created op without more context.

  • Another aspect is that of merging attributes, which this proposal does not address at all.
    When matching a few ops and rewriting them, it is too limiting to just take one of the ops as reference for the propagation: we need to make a decision about how to merge the attributes defined on these multiple ops and how to propagate to the set of newly created ops.

  • Finally there are things like function parameters attributes where the dictionary of attribute is nested and the current proposed interface can’t handle implicitly.

We may propose to continue to exchange on this topic on Thursday during the ODM meeting? In the meantime feel free to fuel the discussion here.

2 Likes

@mehdi_amini it seems like this proposal never materialized into actual changes to MLIR. Would you remember what kind of workaround you guys ended up implementing for TensorFlow?

Unfortunately I don’t remember right now. Maybe @jpienaar does?