[RFC] Target type classes for extensibility of LLVM IR

Background

Building high-level compilers on top of LLVM sometimes requires the representation of high-level types that don’t fit neatly into LLVM’s type system. @jcranmer recently added “target extension types” which provide an initial extension point to LLVM’s type system.

Already today, high-level frontends can define their own “target” extension types (the name is misleading) as long as these types are lowered away somehow before the IR reaches the backend.

However, some uses of types require additional “type info” that is not captured by TargetExtType type itself, such as a data layout and whether nullinitializer is allowed.

Proposal

I’m proposing to introduce the notion of “target (extension) type classes” that can be registered with an LLVMContext.

Target types are still identified by name (and type / int arguments), but may fall into a type class based on their name. If they do, the type class is queried for the additional “type info”. The currently hard-coded type info is replaced by setting up some built-in type classes as part of the LLVMContext constructor.

This is a relatively small change at just over 100 lines in D147697, but it is enough to allow users of LLVM to define their own custom types and enjoy the full flexibility that is available to built-in target types.

In addition to the LLVM change, you can also see how we intend to use them in llvm-dialects in this pull request.

Alternatives

While the type info could theoretically be stored in the TargetExtType object itself, that route was not taken when target types were added for good reasons, and I’m not looking to change that.

In the current proposal, type classes are optional. One alternative would be to make them non-optional. That’s technically a breaking change, though almost certainly palatable at this early stage. I didn’t go down this route because it seemed unnecessary, but am curious to hear feedback.

Another alternative would be to try to encode the type info somehow in the DataLayout. However, that quickly becomes intractable: target types are parameterized, and how would we encode complex dependencies of e.g. the layout on those parameters?

Finally, I made the choice to cache the type class pointer in TargetExtType for faster access, avoiding repeated string comparisons for a small compile-time benefit. While this makes the type object slightly larger, and I did consider an ID-based scheme that would save some space, types are uniqued per context and their memory footprint is correspondingly small (and it only affects target types anyway).

Future work

This current proposal brings us closer to a healthy representation of extension types in high-level compilers built on LLVM. It allows us to remove a bunch of ugly hacks in LLPC, for example.

That said, even with this proposal, extension types are still in a bit of an awkward place since some them really are genuinely opaque and don’t even have a known byte size, yet we would like them to appear in alloca & friends for function-local variables. This is becoming more urgent with the proposal to replace GEPs with ptradd.

We currently paper over the related issues, but I think longer term we’ll want explicit “structured/opaque alloca” and “structured/opaque getelementptr” instructions to move to an entirely sound representation of what the high-level languages require from us.

I’m generally on board with the idea of allowing target types to be externally configured, rather than hardcoded in LLVM.

Something that’s not really clear to me in your particular scheme is who/what is responsible for registering these “type classes” and how we ensure that the information is available when necessary.

It sounds to me that you currently envision the frontend registering these, but this will not work in LTO scenarios. While this may not be relevant for your specific use case, I would expect this to become a problem in the future. To make LTO work, all the necessary information needs to be encoded in the LTO inputs in some form.

My general expectation here has always been that we will extend this in a way that the “type class” information is encoded as part of the IR module, in the sense that this information will be part of the textual IR and bitcode representations.

In the proposal, the creator of the LLVMContext is responsible for registering all the needed (non-builtin) type classes before loading/building IR.

This is because once we’ve looked up the type class for the first target type, adding more type classes risks changing the answer. It’s vaguely analogous to opaque pointers in that way.

My thinking about LTO is pragmatic. If LTO is to see these extended types, then it must include some transform in its pass pipeline that lowers them away. I.e., there must be awareness of the types as part of the LTO setup. Presumably, whichever mechanism would be used to make LTO aware of the types (plugins? or they become builtin?) could also ensure that the type classes are registered in the first place.

Admittedly, there are some potential developer use cases that aren’t covered in my proposal. For example, we can generate textual (or bitcode) IR files which pass the verifier in our compilation pipeline, but loading them into opt may fail verification because a nullinitializer exists for a target type for which opt doesn’t have a type class.

When I discussed something along those lines with @mehdi_amini in the MLIR context he pointed out that this doesn’t tend to be an issue in practice because one tends to have separate tools anyway. Indeed, for our use case we do have our own opt-alike tool. Though MLIR doesn’t have a generic representation of types in the first place.

I would like that very much.

Depending on “front-ends” to set things up is severely limiting and I don’t really see the reason we should do that early on.

What limitations do you have in mind? Front-ends do generally have to be aware of the entire compilation pipeline, e.g. setting up the target machine info. I don’t think this is particularly burdensome: if a front-end integrates some middle-end library that uses extended types, then presumably the middle-end library would offer a helper function that does all the required registrations.

Still, being able to use agnostic tools like opt is a plausible goal, so let’s explore how encoding the type info in the IR would work.

I think an important stake in the ground is that extension types are only identified by their name and parameters. Stated differently, you can’t have two types with the same name and parameters but different type info.

Some related points:

  1. The textual IR representation remains target("name", params...). Type info is somewhere out-of-line at the top level of the .ll file.
  2. Getting extended types in C++ compiler code remains a matter of TargetExtType::get(context, "name", type_params, int_params).
  3. Since types are global to an LLVMContext, it is an error to load two modules that try to define the same extended type with different type info.

Can we agree on that as a framework?

I think those constraints sound reasonable. The only part that is a bit unfortunate is…

…but given the technical constraints we have, this wouldn’t be easy to avoid, and is unlikely to be problem in practice.

Since types are global to an LLVMContext, it is an error to load two modules that try to define the same extended type with different type info.
…but given the technical constraints we have, this wouldn’t be easy to avoid, and is unlikely to be problem in practice.

Pointer size mismatch is the one thing that gives me pause here, although the existing functionality is probably sufficient to handle an issues that might arise.

Getting extended types remains a matter of TargetExtType::get(context, “name”, type_params, int_params).

While I agree with all of the constraints, I do want to highlight that I find this constraint to be the most non-negotiable: it needs to be possible for code manipulating IR to get target extension types only via the {context, name, parameter} tuple, without having to independently pull out the type info from somewhere else.

The broader dilemma I see is that target extension types represent something that’s akin to datalayout, but is too lengthy to be encoded in existing datalayout strings. One possibility that I can see is that we have IR/bitcode emission include a type information table for all of the target extension types actually used in the module, which means that downstream tools (e.g., something like llvm-reduce) can operate on the IR target-agnostically without having to figure out what to link in to get the relevant type information.

Yes, that makes sense to me. I’m going to draft up something in that direction. I’m on vacation for a bit over the next two weeks, so expect it to take a while.

I think keeping opt and other tools working in the presence of more involved dialects is important.
We can run opt -O3 on any module or even load plugins with custom passes and I don’t need to write an extra tool for setting up the types and then running the opt pipeline internally.

To provide some context from MLIR: we tried really hard to ensure “system consistency” and reduce the amount of “surprising behavior”. We prefer to fail to process something than silently discarding information or degrading the mode (some flags exists to opt-in to disable some safeguards and proceed forward with “unsafe modes”).
For example, if the registration of some “target (extension) type classes” can impact the behavior of opt -O3 then we would rather fail to load a module than processing it without the registrations. This is to protect for example “IR reproducer” (think of something happening in production that you’re trying to reproduce with opt but fail to reproduce because of this kind of setup).

Brief update report on this.

I have an in-progress change that records the TargetTypeInfo of target types. It prints and parses IR assembly with type info encoded with syntax like this:

type target("b", 1) {
  layout: type i32,
  hasZeroInit: true,
  canBeGlobal: true,
}

I have kept the notion of a TargetExtTypeClass as well, for two reasons:

  • When a type is obtained from regular code via TargetExtType::get(Ctx, "name", Types, Ints), the class (if one is registered) is queried for the type info.
  • The type info from the registered class takes precedence, and loaded IR modules are checked against it as a form of validation.
  • The class has a Validate hook that allows validation of a (name, type_params, int_params) tuple.

It’s still missing bitcode support, which I hope to be able to get around to over the next week or so. Please let me know if you have strong objections to this approach so I don’t waste my time :slight_smile:

That makes sense to me. Clear warnings at least seem to be in order.

In the particular change I’m proposing, I think we’re okay. All information provided by the TargetExtTypeClass that can affect generic transforms is encoded in the IR module.

If a type class isn’t registered, then validation is missing, and getting a new type may not work correctly. But the validation is only for the types themselves, and generic transforms shouldn’t be attempting to obtain a new, unknown extension type anyway. They should only preserve the types that are already there.

Hi all,

I just uploaded the updated version of this change to Phabricator – see also the stack of related changes. You can also find it as a branch on GitHub. This is now ready to go in as far as I’m concerned.

The overall design follows what I’ve laid out in my previous comment last week, following the feedback from initial discussions.

I have decided to mediate the IR printing/parsing and bitcode writing/reading changes via a “structured data” representation. The benefits are defining a systematically more regular, extensible syntax in IR assembly (which looks like the example I’ve posted before) and a certain amount of self-description in bitcode (which makes it easier to extend the type info while maintaining backwards compatibility). I do have an ulterior motive for this as well, which is to use the same mechanism for human-readable IR metadata roughly along the lines of DI metadata, but defined in a more principled manner.

Another month, another update on this. I’ve integrated all the review comments so far; the remaining changes to be reviewed (all linked together in a stack) are:

I’d appreciate another round of reviews!

I wanted to cross-post a question from ⚙ D147697 [IR] Add TargetExtTypeClass here, as that’s my main open question for this functionality in terms of high level design.

What the patch currently proposes is that the specified type properties are specific to a given combination of arguments. This means that something like

type target("mytype") {
  layout: type i8,
  hasZeroInit: i1 true,
}

will apply to target("mytype") only, but not to target("mytype", i8) or target("mytype", 123). Instead, the type properties would have to be repeated for each argument combination in use:

type target("mytype", 0) {
  layout: type i8,
  hasZeroInit: i1 true,
}
type target("mytype", 1) {
  layout: type i8,
  hasZeroInit: i1 true,
}
type target("mytype", 2) {
  layout: type i8,
  hasZeroInit: i1 true,
}
; ...

I would like some input from people using target type (especially @jcranmer) on whether this makes sense or not.

Generally, the three ways I can see this working are:

  1. Type properties are bound to the type name only, but the same across all arguments.
  2. Type properties are bound to the specific type name + arguments combination (current proposal).
  3. There is some inheritance scheme, e.g. if you specify properties for type target("mytype") this is inherited to type target("mytype", 1), but can still be overridden.

So thanks to @jcranmer for weighing in on the review, but it feels like nobody really has a really strong opinion here, so how do we decide?

Part of the problem is that we don’t know the future and how this is going to be used. I do lean towards making this more flexible (meaning: having the properties be per-type), because I think either model (between options 1 and 2) is equally easy to understand, and apart from that I see the potential upsides and downsides as:

  • If it turns out that per type properties are useful in the future, then adding that in after the fact is likely to become a bit painful (questions of upgrade path, backwards compatibility etc.)
  • If it turns out that per type properties will never become useful, then the only cost is ~8 bytes extra per target type instance and potentially a bit of redundancy in .ll/.bc files