Modeling LLVM IR Struct Types

Rationale

Currently, the LLVM dialect type in MLIR wraps instances of llvm::Type * and delegates printing, parsing and uniquing to LLVM itself. While this approach has benefits, in particular the exact correspondence of the type system, identical syntax and trivial translation logic, it has two major drawbacks.

  1. The type system is not complete. More specifically, it is currently impossible to roundtrip identified structs. This is due to llvm::parseType expecting to be called in a context where all identified structs have been already defined in a module.
  2. This approach requires LLVM dialect to own an LLVMContext (and an llvm::Module) which is not thread-safe, unlike MLIRContext. This in turn requires additional synchronization when working with the LLVM dialect, which has led to multiple issues when attempting to achieve parallel compilation (e.g. D78207, D78580).

An alternative is to replicate the LLVM IR type system within MLIR, providing first-class modeling for all LLVM types as MLIR dialect types. However, such modeling is challenging, if at all possible, with the current MLIR approach.

Note: I do not (yet) propose to change the representation of LLVM dialect types, but seek how to work around expressiveness limitations.

Working Example

LLVM IR supports identified structs in order to enable (mutually) recursive types. For example, one can specify a type that contains a pointer to another, not-yet-defined type.

%a = type { %b* }
%b = type { %a* }

This is currently impossible in MLIR: a type must be fully specified when constructed.

Furthermore, MLIR types are uniqued in context whereas identified structs in LLVM IR are said to be not uniqued. If an identified struct with the same identifier is created programmatically or parsed in a different module but in the same LLVMContext, it is renamed automatically.

Representing such types in MLIR requires addressing two issues:

  1. Type uniquing;
  2. Type parsing.

Possible Representations

LLVM supports identified struct by decoupling type creation from initialization: it is possible to create an identified struct, use it in constructors of another type, and then initialize the type by setting up its body.

Practically, the behavior of LLVM IR with respect to identified structs can be seen as ensuring a unique pair “name + module”. That is, in a given module, only one instance of an identified struct with the given name can exist. This is relatively straightforward to implement in MLIR by extending the type uniquing infrastructure to be key/value storage, where the value is mutable and can be used to store the body of an identified struct after the struct itself was constructed and uniqued. Calling get with the given name and module can return the same type with mutable body. It is equally possible to auto-rename upon construction if the name is already in use.

Module-based Uniquing

The challenging part in this scheme is referring to the module and parsing. If we follow LLVM, identified structs with the same name but different body can appear in different modules. Furthermore, we need to avoid uses of identified structs defined in another module. To achieve this, when parsing a type, we would need to identify to which module does the type belong. This sounds challenging in MLIR since there is no way to guarantee a type is only used within a module. Moreover, MLIR types are immortal in context whereas modules are not, although one can arguably make the modules immortal as well.

It is unclear how to refer to the “name + module” pair in the syntax. Giving modules a name only delays the problem of having duplicate names (imagine parsing the same named module twice in the same MLIRContext). In the custom syntax, we could follow LLVM by requiring identified structs to be listed upfront in a hypothetical llvm.module operation (as attributes) and implement local renaming, i.e. after parsing a struct identified, look up a uniqued version in a table and use it to get the type instead of the original name. This wouldn’t work if the llvm.module operation is presented in the generic syntax, with the dialect registered, because we cannot inject behavior into the generic parser.

Alternatively, MLIR provides type aliases, but they are declared in the top-level module and used by the parser only. Providing scoped aliasing capability, where an alias is only active within a top-level module or within a given region, could be a potential solution.

Delayed Resolution

An alternative representation would only use and unique the identifier of the struct. The mapping between identifiers and bodies could be stored in a hypothetical llvm.module operation, e.g. as an array-of-types attribute. Functions accessing or modifying the body of such type would take a module operation as argument and modify it, while types themselves would remain immutable. Parsing in this case would be trivial as only the name should be parsed. However, depending on the module, an identified struct "a" could have different bodies. The downside of this approach is that, from the type alone, it is impossible to determine that two identified structs used in different modules have different bodies since the type only contains the name.

Conclusion

It appears that it is currently impossible to achieve exactly the same behavior as LLVM does for identified structs (i.e., unique names with auto-renaming, modules with identical names can be parsed repeatedly without error if in different modules, types with the same name in different modules correspond to different Type instances) without extending MLIR.

I am vaguely inclined to work towards having region-scoped aliases + forward-referencing for types in the alias list and implement structs based on that, but I would like to gather opinions and alternative proposals before proceeding.

This would resemble the following in the textual IR

module {
  llvm.module {
    !a = !llvm.struct<"a", ptr<!b>>
    !b = !llvm.struct<"b", ptr<!a>>
  }
  // cannot use !a or !b outside of scope
  llvm.module {
    !a = !llvm.struct<"a", i32> // this is fine
  }
}

Ping @River707 @clattner

Are there further thoughts on this topic? Or a resolution?

I’ve run into this exact problem lately and would rather not resort to a delayed-resolution approach which works in my particular case (since there are globally unique type IDs). The only problem is that I don’t think is resolved by this particular hack is printing/parsing. I don’t see a way to guarantee that both types be output.

There were no reactions to this thread, as you can see…

River and I chatted about this offline and decided that, for the LLVM dialect, it makes sense to go for a minimal viable solution (likely based on fully printing the struct type every time, except for self-references, and postponing the decision on multi-module type behavior to better times). Do you have a case other than the LLVM dialect?

I’m confused: would you resort to it or would you not?

RE use case: Yes, my primary use case is not LLVM. I’m trying to model a strongly-typed messaging system (for both software and hardware). Said messages are flat memory (C-style structs), so I only run into issues when (buffer) pointers are present. Said pointers are not necessarily pointing to the type in which they are members, they could be indirectly referencing that type. Additionally, void* is not an option since the amount of space/wires to be allocated for the pointed to data is based on the size of the data type which is being pointed to. The aspect that makes the delay-resolution approach without namespaces is that each message has a globally unique, randomly generated id associated with it.

If anyone is curious, I’m shooting for a superset of the Capnp type system.

RE resort to it: if there was some way to ensure that all the needed type data was present in the various serializations, yes I think I would (assuming I don’t run into other issues).

Hello,

I discover this thread while searching for a way to represent and compile (with MLIR) something looking like the following code. Assume it’s a single file, and that I want to compile it without including any other file, meaning that struct mystruct1 remains opaque:

// Extern definitions
struct mystruct1 ;
int externc(struct mystruct1*) ;


// Local definitions for which I want to generate code
struct mystruct2 {
  struct mystruct1* s1 ;
  int i ;
} ;

struct mystruct2 myfun(struct mystruct1* i) {
  struct mystruct2 ms2 ;
  ms2.s1 = i ;
  ms2.i  = externc(i) ;
  return ms2 ;
}

My current assumption is that MLIR can’t currently handle this because there are no standard opaque types. Is this correct?

If yes, then to compile such an example, I would first have to transform all pointers to opaque types into pointers to some other type (e.g. void*, which in MLIR would be memref, except that memref does not work, so it can be memref) and then let the linker do its magic. Of course, the problem with this is that I cannot later place the definition of the function externc in the same module as myfun, because the pointer types are incompatible…

Is this correct?

Please, let me know what you think.
BTW, this type of code would be useful in the compilation of dataflow dialects. This is why I am interested in it.

Best,
Dumitru

It isn’t clear to me what is the limitation right now? It seems that you have your own named “struct” type, unless you can have recursive types you should be able to model this in your own dialect.

AFAICT, there are two potential problems in general:

  • If you’re using string identifiers, you potentially run into namespace issues (depending on the language semantics).
  • In the case where it’s important to have the body of mystruct1 available, both struct types would have to appear in the MLIR output. The workaround of which I am aware is – IMO – pretty hack-y.

I think this is possible today, though I would personally like to see a better solution to both of these issues.

Upto now, for me, one of the main advantages of MLIR is that data handling was almost completely done using existing MLIR types and lowering code. I was able to focus exclusively on the operational semantics of the various levels. If I write my own types, everything becomes more complicated and less elegant. And there’s also the issue of the std dialect, which is currently my exclusive target, and which I’d have to bypass, at least partially. It’s neither semantically, nor engineering-wise elegant. But I think that (after the interesting answers I got in the past few days to several questions, not only this one) I have some ideas on how to work around this.

MLIR type system is not limited to standard types. If they don’t suit your needs, and there are no other types that suit your needs, consider proposing and implementing your own. In this specific example, the LLVM dialect does support opaque named structs since my RFC was implemented. It can be used as a blueprint for implementing other opaque and delayed-initialization types.

Memref is not a pointer and was not really intended as such. There is no standard pointer. I would be interested in a proposal for one.

This heavily depends on the use cases so I took the “non-opinionated” option common to the rest of MLIR infra. One use may want to automatically merge opaque and non-opaque types with the same name (e.g., linker). Another use may want to automatically rename conflicting names (e.g., sequential processing of multiple unrelated modules in a context). A third one may want to just complain (e.g., linter). MLIR infra should support all of them, but I am not aware of any use case struggling with this right now, so thinking about infrastructural support for a specific case or cases sounds premature.

I will be happy to see a better proposal motivated by a use case.

Why? MLIR allows one to mix types and operations from different dialects, so I don’t really see the problem. Standard types are not really special except for shorter syntax and some legacy APIs.

Again, MLIR allows for and encourages mixing different dialects. In a neighboring thread happens a discussion on whether to split the standard dialect into pieces and how. Feel free to chime in.

Sorry… I didn’t mean to insult your approach. It was the best solution given the lack of a clear way to support this today and didn’t involve making substantial changes to MLIR. Also, I agree on this point: I think that a comprehensive proposal should be motivated by several compelling use cases. Until someone either has or has time to write thoroughly about a specific use case, there’s nothing to productively discuss.