[proposal] Extensible IR metadata

Devang's work on debug info prompted this, thoughts welcome:
http://nondot.org/sabre/LLVMNotes/ExtensibleMetadata.txt

-Chris

So this particular metadata would be an extension of the type? And
get propagated through as you create instructions that depend on other
instructions which had the type metadata attached?

I found myself wishing for such a thing several times recently. I
ended up using a "type tag" of type [0 x opaque*] in my structs to
force the type system to differentiate them from each other and make
unique classes out of them. It works, but I need a separate hashtable
to get back to a "class description" from a Value of a given Type.

No, this would be a property of the operation. In a dynamically typed language like python (and many others) a naive translation will turn all python objects into "void*"s. However, with some static or dynamic analysis, many types can be guessed at or inferred. This is a property of various operations, not about "void*".

-Chris

I've got a suggestion for a refinement:

Right now, we have classes like DILocation that wrap an MDNode* to
provide more convenient access to it. It would be more convenient for
users if instead of

  MDNode *DbgInfo = Inst->getMD(MDKind::DbgTag);
  Inst2->setMD(MDKind::DbgTag, DbgInfo);

they could write:

  DILocation DbgInfo = Inst->getMD<DILocation>();
  inst2->setMD(DbgInfo);

we'd use TheContext->RegisterMDKind<MyKindWrapper>() or
...Register<MyKindWrapper>("name"); to register a new kind. (I prefer
the first.)

These kind wrappers need a couple methods to make them work:

const StringRef KindWrapper::name("...");
KindWrapper(MDNode*); // Except for special cases like LdStAlign.
KindWrapper::operator bool() {return mdnode!=NULL;} // ??
int StaticTypeId<KindWrapper>::value; // Used for the proposal's MDKind
KindWrapper::ValidOnValue(const Value*);
MDNode* KindWrapper::merge(MDNode*, MDNode*) // For the optimizers

StaticTypeId is a new class that maps each of its template arguments
to a small, unique integer, which may be different in different
executions.

Since the optimizers may want more methods over time, but we don't
really want to require users to extend their wrappers, we should say
that all wrappers must inherit from a particular type. I'd name this
type "MDKind" and rename the proposed MDKind to MDKindID. Then we can
add defaults to MDKind over time. Nothing needs to be virtual since
these types are all used as template arguments.

We could either use a global list of IDs for the MDKinds or have
separate lists for each Context. StaticTypeId can only provide a
global list, so giving each Context its own list would take an extra
lookup, and wouldn't provide any benefit I can see.

Chris mentioned that .bc files would store the mapping from name->ID,
so the fact that StaticTypeId changes its values between runs isn't a
problem.

Thoughts?

The document mentions "instructions" a lot. We'll want to be able to
apply metadata to ConstantExprs as well at least, if not also Arguments
(think noalias) and other stuff, so it seems best to just talk about
"values" instead, and DenseMap<Value *, ...> instead of
DenseMap<Instruction *, ...>.

Dan

I've got a suggestion for a refinement:

Right now, we have classes like DILocation that wrap an MDNode* to
provide more convenient access to it. It would be more convenient for
users if instead of

  MDNode *DbgInfo = Inst->getMD(MDKind::DbgTag);
  Inst2->setMD(MDKind::DbgTag, DbgInfo);

they could write:

  DILocation DbgInfo = Inst->getMD<DILocation>();
  inst2->setMD(DbgInfo);

we'd use TheContext->RegisterMDKind<MyKindWrapper>() or
...Register<MyKindWrapper>("name"); to register a new kind. (I prefer
the first.)

Yes, this is very convenient. This along with the rest of Chris' proposal is
very similar to the way we handled metadata in a compiler I worked on years
ago. It was so useful we even used it to stash dataflow information away as
we did analysis. Of course we had metatadat tagged on control structures as
well. I'd like to see the currently proposal extended to other constructs as
Chris notes.

StaticTypeId is a new class that maps each of its template arguments
to a small, unique integer, which may be different in different
executions.

How does this work across compilation units? How about with shared LLVM
libraries? These kinds of global unique IDs are notoriously difficult
to get right. I'd suggest using a third-party unique-id library. Boost.UUID
is one possibility but not the only one.

I have a few questions and comments about Chris' initial proposal as well.

- I don't like the separation between "built-in" metadata and "extended"
  metadata. Why not make all metadata use the RegisterMDKind interface and
  just have the LLVM libraries do it automatically for the "built-in" stuff?
  Having a separate namespace of enums is going to get confusing. Practically
  every day I curse the fact that "int" is different than "MyInt" in C++. :-/

- Defaulting alignment to 1 when metatadata is not present is going to be a
  huge performance hit on many architectures. I hope we can find a better
  solution. I'm not sure what it is yet because we have to maintain safety.
  I just fear a Pass inadvertantly dropping metadata and really screwing
  things up.

This looks very promising!

                            -Dave

I wrote: "Note that this document talks about metadata for instructions, it might make sense to generalize this to being metadata for all non-uniqued values (global variables, functions, basic blocks, arguments), but I'm just keeping it simple for now."

However, constant exprs are uniqued. What would you find it useful for?

-Chris

Devang's work on debug info prompted this, thoughts welcome:

http://nondot.org/sabre/LLVMNotes/ExtensibleMetadata.txt

The document mentions "instructions" a lot. We'll want to be able to

apply metadata to ConstantExprs as well at least, if not also Arguments

(think noalias) and other stuff, so it seems best to just talk about

"values" instead, and DenseMap<Value *, ...> instead of

DenseMap<Instruction *, ...>.

I wrote: "Note that this document talks about metadata for instructions, it might make sense to generalize this to being metadata for all non-uniqued values (global variables, functions, basic blocks, arguments), but I'm just keeping it simple for now."

I missed that part.

However, constant exprs are uniqued. What would you find it useful for?

We have inbounds on ConstantExprs today, for example.

Dan

... and it was an interesting source of problems. Do you think that inbounds on constantexprs is really a good idea? It means that we can get into a world where we have: "gep p, 0, 1" and "gep inbounds p, 0, 1" not be uniqued.

The impact of this is somewhat reduced by libanalysis and vmcore trying to infer inbounds etc. Instead of putting inbounds on the constantexpr, why not make that "inference" be a predicate that any client could ask of the constantexpr?

-Chris

template<typename T>
class StaticTypeId {
  static int id;
}
extern int NextStaticTypeId; // Initialized to 0. Possibly an atomic
type instead.
template<typename T> int StaticTypeId<T>::id = NextStaticTypeId++;

This relies on the compiler uniquing static member variables across
translation units, and I've never tested that across shared library
boundaries. The initializer didn't work with gcc-2 (there was a
workaround), but I believe it works with gcc-4. I've never tested it
with MSVC. We can also use static local variables, which would have a
different set of bugs, but they're very slightly slower to access.

Since there's a registration step, we could also use Pass-style IDs,
and have the registration fill them in, which would avoid uniquing
problems.

I have a few questions and comments about Chris' initial proposal as well.

- I don't like the separation between "built-in" metadata and "extended"
metadata. Why not make all metadata use the RegisterMDKind interface and
just have the LLVM libraries do it automatically for the "built-in" stuff?
Having a separate namespace of enums is going to get confusing. Practically
every day I curse the fact that "int" is different than "MyInt" in C++. :-/

"builtin" metadata would also be registered, the only magic would be that the encoding would be smaller in the IR.

- Defaulting alignment to 1 when metatadata is not present is going to be a
huge performance hit on many architectures. I hope we can find a better
solution. I'm not sure what it is yet because we have to maintain safety.
I just fear a Pass inadvertantly dropping metadata and really screwing
things up.

I don't expect metadata to be commonly stripped. This could be just as bad a perf problem for other things like TBAA or high level type information for a dynamic language. I think it is important that the IR is possible to reason about even in uncommon cases though.

-Chris

> - I don't like the separation between "built-in" metadata and
> "extended"
> metadata. Why not make all metadata use the RegisterMDKind
> interface and
> just have the LLVM libraries do it automatically for the "built-in"
> stuff?
> Having a separate namespace of enums is going to get confusing.
> Practically
> every day I curse the fact that "int" is different than "MyInt" in C
> ++. :-/

"builtin" metadata would also be registered, the only magic would be
that the encoding would be smaller in the IR.

Except the API is different. Built-in types use a well-known enum
value not available to extended metadata. I have no problem with a
smaller IR encoding. It's the programming interface I'm concerned
about. I'd rather it be the same for everything.

I don't expect metadata to be commonly stripped. This could be just
as bad a perf problem for other things like TBAA or high level type
information for a dynamic language. I think it is important that the
IR is possible to reason about even in uncommon cases though.

Sure. Just something we need to be aware of.

                               -Dave

This relies on the compiler uniquing static member variables across
translation units, and I've never tested that across shared library
boundaries. The initializer didn't work with gcc-2 (there was a
workaround), but I believe it works with gcc-4. I've never tested it
with MSVC. We can also use static local variables, which would have a
different set of bugs, but they're very slightly slower to access.

Shared libraries are the big problem. I know the Boost guys had endless
discussions about how to design a Singleton to work in the presence of shared
libraries and this is pretty close to the same problem.

Since there's a registration step, we could also use Pass-style IDs,
and have the registration fill them in, which would avoid uniquing
problems.

Yes, I think that should work. Doing things with static initializer magic is
asking for trouble.

                                 -Dave

Dan Gohman wrote:

The pushback has been about adding lots of weird and special purpose extensions, not the encoding.

-Chris

Chris Lattner wrote:

Dan Gohman wrote:

Devang's work on debug info prompted this, thoughts welcome:
http://nondot.org/sabre/LLVMNotes/ExtensibleMetadata.txt

The document mentions "instructions" a lot. We'll want to be able to
apply metadata to ConstantExprs as well at least, if not also Arguments
(think noalias) and other stuff, so it seems best to just talk about
"values" instead, and DenseMap<Value *, ...> instead of
DenseMap<Instruction *, ...>.

I'm wondering that too. Can we replace LLVM function attributes with metadata? There's been some pushback to adding new function attributes in the past and it would be nice to be able to prototype new ones without having to change all of the vm core.

The pushback has been about adding lots of weird and special purpose extensions, not the encoding.

The bar is higher for getting something into the vm core, as it should be. It sounds like we're planning to permit special purpose metadata which is why I asked.

If nothing else, it would be more convenient to prototype new extensions to find out what they're really worth.

Nick

I just wonder how stable that would become as time passes by.

It is true that enabling the addition of metadata as part of the
structure is good for specialized optimizations that are not
represented internally (or relevant) in the LLVM core, but there is a
practical limit on what you can do.

My (humble) opinion is that basic language structure, such as function
attributes, should still be part of the core. And if there isn't
always easy ways of getting them and passing them through we should
make it easier... in the core.

Metadata is a completely different beast. It's good for things that
only your own optimization pass or machine code will understand. It's
an additional rather than required information, which the lack of
would be completely harmless.

I completely agree with the text argument that the demand (and
necessity) for metadata is increasing, but that doesn't mean we should
transform everything into it.

The RDF [1] developments sent a clear message that metadata per se are
too loose to hold value. We need a fixed, basic structure on where to
stick metadata, otherwise it'd just be a big slimy blob of untreatable
data.

My two cents...

cheers,
--renato

[1] RDF - Semantic Web Standards

Reclaim your digital rights, eliminate DRM, learn more at
http://www.defectivebydesign.org/what_is_drm

Yep, I completely agree!

-Chris

IMO, there is not any need to add two llvm::Instruction methods,
getMD and setMD. The metadata associated with an instruction will be
store separately anyway.

Right now, I am preparing a very simple implementation that allows us
to make progress on debug info front. And the same time, it'd be
possible for someone to extend it for other uses.