[RFC] Structured data for extensibility in LLVM IR

Hi all,

In ⚙ D150370 Introduce StructuredData and related patches (see the Stack tab) I am proposing a form of “structured data” in LLVM IR, and following review discussions there I’m publishing this RFC to raise broader awareness and call for feedback/comments.

Overview

In a nutshell, the idea of “structured data” is to have a generic JSON/YAML/etc.-style object/data notation and representation in LLVM IR, geared towards the specific needs of LLVM IR. The potential use cases are broad, and I will provide more on that below. For now, imagine being able to write metadata such as:

!interval{ lo: i48 5, hi: i48 1234567890 }

The part in braces is “structured data”, and the example already shows one way in which it is geared towards LLVM IR: integers have a bit width and are represented as APInts.

Another example taken from the use cases below is defining properties of target types using the following syntax:

  type target("foo") {
    layout: type i64,
    canBeGlobal: i1 true,
  }

Again, the part in braces is “structured data”, and as you can see it allows referring to llvm::Types in a straightforward manner.

Structured data is primarily meant to be used as a generic representation and notation when printing and parsing IR assembly and reading and writing bitcode. It is not meant to be used as an active compile-time representation:

In the type info example above, the structured data is parsed from an .ll file using a generic parser, but is then stored (via a type info deserializer) in a purpose-specific C++ class where e.g. the layout field is represented directly as an llvm::Type *.

There are arguably already areas in LLVM that “want” to have this capability, most notably metadata. Debug info metadata uses a syntax sort of like the proposed structured data if you squint enough, except that it is all rather ad hoc and implemented intrusively with a lot of boiler plate code. The intention of the proposal is that use cases with needs similar to those of debug info will prefer to use structured data in the future.

The set of features in structured data that I ultimately envision are:

  • Key-value maps as shown above
  • Integers and floats as APInt and APFloat, respectively
  • llvm::Type and llvm::Metadata references
  • Strings, enabling a human-readable representation of enums
  • Heterogenous arrays (not implemented)
  • Arbitrarily nested structures(?)

There is an argument to be made that structured data may also want to allow llvm::Value references, though we can always use ValueAsMetadata.

The implementation under review on Phabricator is a very minimal MVP geared towards the initial use case.

The use cases I have in mind so far are, from short term to long term:

  1. Target type info
  2. Extended metadata
  3. Extended instructions

Please let me know what you think, give feedback on the details, and so on. The rest of this post contains details on the use cases.

Use case: Target type info

See a separate thread for the full details. In a nutshell: target() types have certain associated properties that we need to be able to represent in LLVM IR.

Instead of coming up with yet another ad hoc syntax extension and writing yet another ad hoc printer/parser pair for this, I want to use structured data.

This particular use case doesn’t benefit much from structured data (other than having a systematic syntax extension). But it serves as a convenient testing vehicle for the idea.

Use case: Extended metadata

This is the first truly load-bearing use case I have in mind. I posted a very early draft on Phabricator:

The key idea is to have a new type of metadata which is defined by a class name and a structured data payload. Quoting from the draft:

Benefits of extended metadata

Extended metadata is intended to have three main advantages over MDTuples for building complex metadata structures:

  • IR assembly can contain meaningful label names, making it easier to read for humans.

  • Extended metadata is represented using C++ objects that can have meaningfully named accessors (no more MD.getOperand(MAGIC_NUMBER))

  • Extended metadata is represented using plain data types, so that the in-memory representation is smaller and generally more efficient (no more pointer chasing from MDNode → ValueAsMetadata → ConstantInt)

Debug info metadata is already built to have these advantages, but the approach used there does not scale well:

  • Lots of boilerplate code in the definitions of the various DI* metadata classes

  • Intrusive boilerplate code in LLParser/AsmPrinter/MetadataLoader/BitcodeWriter

  • Completely ad-hoc .ll syntax gets in the way of tooling

  • Not usable by downstream users of LLVM for many reasons (intrusive code in parser/printer/loader/writer, MetadataKind is a closed enum that is switch()ed over in many places)

Extended metadata as presented here fixes all of these issues except for the first one. For the first issue, I am considering a TableGen-based solution along the lines of llvm-dialects and MLIR ODS, but that is not strictly needed to make extended metadata work, and we should also consider a CRTP-based solution at some point.

Overview of extended metadata

In a nutshell, the solution of extended metadata has the following pieces:

  • ExtMetadata is an abstract base class representing extended metadata. Extended metadata objects are defined by their class name and a structured data object as payload.

  • ExtMetadata classes can be registered with LLVMContexts at context creation time (the set of classes is frozen the first time an extended metadata object is created). Registering a class means:

    • Defining a C++ subclass of ExtMetadata that can be serialized and deserialized to structured data

    • Receiving a numeric class ID that is used to hook the C++ subclass into LLVM’s custom RTTI system (isa<>, cast<>, etc.)

    • Being able to hook custom verification into the IR verifier

  • IR may contain extended metadata whose class (name) has not been registered with the LLVMContext. Such metadata is preserved as a black box using the GenericExtMetadata class. This situation can happen when IR is written out from (an intermediate stage of) a compiler built on LLVM that registers its own extended metadata classes, and this IR is then fed into generic tools like opt, llvm-reduce, llvm-dis, etc.

Use cases of extended metadata

I believe there would be some benefit to porting existing metadata uses in LLVM to this new infrastructure, e.g. AA metadata, for the reasons listed above (e.g. compile-time improvements). That said, my primary motivation for doing this work is in downstream compilers.

In our graphics shader compiler use case (LLPC), there is a lot of metadata specific to graphics APIs. One such example is rasterizer state, which is a collection of settings (typically bools or small integers) that tweak aspects of rasterizations that are conceptually fixed function (but that impact the shader compilation process in some way).

In the status quo, we are faced with an awkward choice:

  • either we represent this state entirely outside of IR, which breaks common workflows like lit testing because the state is missing from .ll files

  • or we represent it using MDTuples in some way, but the tuples are far from human readable and compile-time access to them is slow.

With extended metadata, we would be able to represent rasterizer state in human-readable form in .ll file along the lines of:

!lgc.rasterizer.state = !{!0}
    
!0 = !lgc.rasterizer.state {
  discardEnable: i1 true,
  perSampleShading: i1 false,
  rasterStream: i2 0,
  ...
}

And at compile time, this structure is represented by an lgc::RasterizerStateMetadata class that is derived from llvm::ExtMetadata and contains all these fields as plain C++ bools or integers, which results in code that is both easier to read and faster.

Use case: Extended instructions

This follows a similar approach as extended metadata at a high level to gain similar benefits in the definition of custom instructions:

  • Better readability than call instructions
  • Extensible by downstream users (unlike intrinsics)
  • More efficient compile-time representation of “immargs” (consider: the volatile bit of load is represented far more efficiently than the volatile bit of the memcpy intrinsic; extended instructions unlock the efficiency everywhere)

And the approach follows the broad strokes of extended metadata:

  • ExtInst abstract base class in the Instruction hierarchy conceptually holds all the information required by extended instructions: instruction name / opcode, operands, result type, attributes (in the call instruction sense, for side-effect modeling), structured data for additional constants

  • ExtInst subclasses / opcodes can be registered with LLVMContexts

  • GenericExtInst class generically represents extended instructions for which a subclass hasn’t been registered

Unlike for extended metadata, the benefits of extended instructions don’t lead entirely on structured data, but structured data is a large part of the value proposition.