LLVM IR Type System Rewrite

Several people have been proding me to write up my thoughts on how to fix the IR type system for LLVM 3.0. Here are some (fairly stream of conscious) thoughts on the matter:
http://nondot.org/sabre/LLVMNotes/TypeSystemRewrite.txt

Comments welcome!

-Chris

Hi Chris,

I don't see such a bad problem with PATypeHolder, but I get what you
say about the type refinement.

In the end, the logic is the same. You hold the type in a temporary
state, build it as you go and mark it as "complete" when you finish.
If you use type holders or the types themselves, it doesn't matter, as
long as there is a way to mark them as complete. In the end, you still
need to reject incomplete types on validation.

For IR readability, I welcome types being handled by name. This has
been a bit of a nuisance, but again, nothing terrible.

Since you're refactoring types, how about a quick detour to see if
it's possible to include unions and bitfields in the IR as well? I'm
not saying to write a full-blown spec, but would be good at least to
consider, so when someone finally have time to implement them, it's
not impossible... I think there are plenty of people in this list,
from different backgrounds and languages, that would welcome it.

I can send the first round of proposals, but you guys will have to
help me with the back-end implementation, as it's not my strong side.

cheers,
--renato

Hey Chris,

Several people have been proding me to write up my thoughts on how to fix the IR type system for LLVM 3.0. Here are some (fairly stream of conscious) thoughts on the matter:
http://nondot.org/sabre/LLVMNotes/TypeSystemRewrite.txt

Awesome! This is definitely long overdue...

Proposal seems very reasonable to me, I think the linker logic will
become much nicer overall by making things more explicit.

Strictly speaking, does this need to be a 3.0 feature? I haven't
thought about it, and don't know that part of the code particularly
well, but is the idea it would be too hard to auto-upgrade to this?

- Daniel

So struct types would unique by name. How important would it be for
the linker to preserve those names? Because I can think of a few
examples where it could be problematic. For instance, if you want to
link

  %Foo = type { i32 }
  %Bar = type { i32 }

  %myFoo = global %Foo
  declare void @useBar(%Bar* %arg)
  ; ...

with

  %Baz = type { i32 }

  %myFoo = global %Baz
  declare void @useBar(%Baz* %par)
  ; ...

then AFAICS the linker would either have to insert some casts (ugly)
or it would need to somehow recognize that %Baz, %Foo *and* %Bar need
to be merged into one type.

Inserting casts would mean that a way to cast between different struct
types is needed (for instance, bitcast could be extended to allow
casting between struct types). Which raises the question of what casts
to allow:
a) between struct types that are structurally identical (which leads
us back to graph isomorphism, though with only two types at a time to
worry about), or
b) between struct types with the same size (which might not be
knowable without target data), or
c) some other criterium I haven't thought of yet.

Merging pre-existing struct types (i.e. types in the source module)
would likely be problematic as well unless something like a "use list"
of types is still maintained. In other words, you still wouldn't be
able to get rid of PATypeHolder and friends.
Also, this wouldn't preserve the names of some of the struct types,
which you mentioned as a design goal.

Is there some other solution for this I'm not seeing, or is one of the
ones I mentioned less problematic than it appears at first glance?

No, I expect it to be quite compatible with existing .ll files (at least those generated by the asmwriter, it's not an accident that it does certain things certain wayts).

However, implementing it in a rush before the 2.9 branch seems unwise. It will also remove some IR features (like \1*) which seems unfriendly in a 2.x.

-Chris

Several people have been proding me to write up my thoughts on how to fix the IR type system for LLVM 3.0. Here are some (fairly stream of conscious) thoughts on the matter:
http://nondot.org/sabre/LLVMNotes/TypeSystemRewrite.txt

Hi Chris,

I don't see such a bad problem with PATypeHolder, but I get what you
say about the type refinement.

In the end, the logic is the same. You hold the type in a temporary
state, build it as you go and mark it as "complete" when you finish.
If you use type holders or the types themselves, it doesn't matter, as
long as there is a way to mark them as complete. In the end, you still
need to reject incomplete types on validation.

For IR readability, I welcome types being handled by name. This has
been a bit of a nuisance, but again, nothing terrible.

Right, I don't see either as a showstopper. The top things for me personally are getType() being slow, and type refinement being slow.

Since you're refactoring types, how about a quick detour to see if
it's possible to include unions and bitfields in the IR as well? I'm
not saying to write a full-blown spec, but would be good at least to
consider, so when someone finally have time to implement them, it's
not impossible... I think there are plenty of people in this list,
from different backgrounds and languages, that would welcome it.

These are quite orthogonal from the changes proposed. I think that unions are well understood (we just need someone who cares enough about non-C languages to make it happen). I don't think that bitfields should be in IR.

-Chris

Several people have been proding me to write up my thoughts on how to fix the IR type system for LLVM 3.0. Here are some (fairly stream of conscious) thoughts on the matter:
http://nondot.org/sabre/LLVMNotes/TypeSystemRewrite.txt

Comments welcome!

So struct types would unique by name. How important would it be for
the linker to preserve those names?

Not at all. The linker will do a best approximation, but it will autorename on unresolvable conflicts, just like we do for internal functions when linking.

Because I can think of a few
examples where it could be problematic. For instance, if you want to
link

%Foo = type { i32 }
%Bar = type { i32 }

%myFoo = global %Foo
declare void @useBar(%Bar* %arg)
; ...

with

%Baz = type { i32 }

%myFoo = global %Baz
declare void @useBar(%Baz* %par)
; ...

then AFAICS the linker would either have to insert some casts (ugly)
or it would need to somehow recognize that %Baz, %Foo *and* %Bar need
to be merged into one type.

In the system I'm envisioning, you'd get a bitcast. The same thing happens today when linking modules that contain things like:

a.ll:
declare void @foo(i32)
... uses of @foo ...

b.ll:
define void @foo(i8*) ...

Inserting casts would mean that a way to cast between different struct
types is needed (for instance, bitcast could be extended to allow
casting between struct types).

We already have this logic. The function itself is cast and instcombine attempts to resolve conflicts where reasonable. There is no IR extension needed for this.

-Chris

+1

With the named/unnamed/anonymous distinctions there seems to be two things going
on:
1. Giving an ID to a struct type.
2. Naming a type for readability.

Would the new system force anonymous structs to be printed inline at every
occurrence? The code might become hard to read with large anonymous types. Would
it make sense to have both ID and Name? If a type has a name it could be printed
using that. Perhaps there could be an extension to indicate that a type has an
ID (something like "%'foo" vs. "%foo"). When printing with -strip any type with
an ID (maybe a number) would be printed using the ID and a type without ID would
have to be printed inline.

My opinion about the syntax is that if there is a semantic difference between
two types it should be very clear in the syntax (maybe a "'" isn't clear
enough). In my mind, naming things in LLVM is used more for convenience or
readability, rather than specifying semantics.

- Jan

Several people have been proding me to write up my thoughts on how to fix the IR type system for LLVM 3.0. Here are some (fairly stream of conscious) thoughts on the matter:
http://nondot.org/sabre/LLVMNotes/TypeSystemRewrite.txt

Thanks!

Comments welcome!

Having fewer types makes the job of the passes easier. Right now we maintain a small number of types as we go, which is expensive. In the proposed change, would it still be legal to write a type merging pass? For example, lets say we have

   %Foo = type { i32 }
   %Bar = type { i32 }
   @a = unnamed_addr constant %Foo { i32 42 }
   @a = unnamed_addr constant %Bar { i32 42 }

The typemerge pass could turn this into

   @a = unnamed_addr constant {i32} { i32 42 }
   @b = unnamed_addr constant {i32} { i32 42 }

And constmerge will then merge them as it does today.

Another comment is that getting anonymous types would still be a bit expensive. Can we get away with just Named types (including Unnamed) and trusting typemerge to do its job?

-Chris

Cheers,
Rafael

With the named/unnamed/anonymous distinctions there seems to be two things going
on:
1. Giving an ID to a struct type.
2. Naming a type for readability.

Would the new system force anonymous structs to be printed inline at every
occurrence?

Yes, but I don't expect them to be commonly used. Things like complex numbers and simple tuples in a frontend might want to use them, but the frontend could always choose to name them as well.

I strongly considered just making all struct types have an "identity", thus eliminating anonymous types. This is effectively what we get in the .ll printer today, but it is a nuisance for simple pairs - like multi-return intrinsics. It still may not be worth the complexity to have them.

The code might become hard to read with large anonymous types. Would
it make sense to have both ID and Name? If a type has a name it could be printed
using that. Perhaps there could be an extension to indicate that a type has an
ID (something like "%'foo" vs. "%foo"). When printing with -strip any type with
an ID (maybe a number) would be printed using the ID and a type without ID would
have to be printed inline.

I think the problem is solved by having frontend authors not generate types that are hard to read.

My opinion about the syntax is that if there is a semantic difference between
two types it should be very clear in the syntax (maybe a "'" isn't clear
enough). In my mind, naming things in LLVM is used more for convenience or
readability, rather than specifying semantics.

Fair point,

-Chris

Comments welcome!

Having fewer types makes the job of the passes easier. Right now we
maintain a small number of types as we go, which is expensive. In the
proposed change, would it still be legal to write a type merging pass?
For example, lets say we have

  %Foo = type { i32 }
  %Bar = type { i32 }
  @a = unnamed_addr constant %Foo { i32 42 }
  @a = unnamed_addr constant %Bar { i32 42 }

The typemerge pass could turn this into

  @a = unnamed_addr constant {i32} { i32 42 }
  @b = unnamed_addr constant {i32} { i32 42 }

And constmerge will then merge them as it does today.

I don't see this as requiring a type merging pass. A better way to handle this (IMO) would be to generalize constant merging to handle bit identities, which would allow it to merge things like:

@a = unnamed constant i32 0
@b = unnamed constant float 0.0

This can be done today, and then the type system change won't affect constant merging.

Another comment is that getting anonymous types would still be a bit
expensive. Can we get away with just Named types (including Unnamed) and
trusting typemerge to do its job?

Getting an anonymous type should be the same cost as getting a struct type today: just look it up in a (redesigned) uniquing table. However, as I just mentioned to Jan, it might be better to just drop the concept of anonymous types entirely. This would simplify the model and is even closer to what we have today.

-Chris

I don't see this as requiring a type merging pass. A better way to
handle this (IMO) would be to generalize constant merging to handle
bit identities, which would allow it to merge things like:

@a = unnamed constant i32 0 @b = unnamed constant float 0.0

This can be done today, and then the type system change won't affect
constant merging.

It is true that it is better to have passes know that different types
can have the same bits, but a type merging pass would still be useful
for the passes that have not been updated or just for reducing the size
of a bitcode file.

The main question is if such a pass would be legal, or if we would
guarantee that Foo and Bar in the above examples would remain as two
different types.

-Chris

Thanks,
Rafael

Hi Chris

I have questions about new IR.

Named structs.
Unnamed structs (named structs with no names)
Anonymous structs (things like complex, which cannot be cyclic)

Given this, the rules above would be that "only non-anonymous structs can have
cycles" for example. Arrays, pointers and other non-namable types can be
considered to be anonymous as well.

When printing out .ll files, anonymous structs (which cannot have cycles) are
just printed inline, as in "{i32, i32}*". Named structs are printed as their
name ("%foo*"), and unnamed structs are printed with a number ("%42*"). This is
nice and compatible with existing .ll files.

When initializer type of Global Variable with StructType is different from type of
Global Variable, current front-end makes an temporary StructType for initializer
as unnamed struct ("%42"). In this case, Is temporary StructType for initializer
Unnamed sturct (named structs with no names)?

And will be key(StructValType) of StructTypes(map) the number of elements and
isPacked in current IR plus a name?
(if so, isPacked is likely to be redundant in some types.)

Thanks,
Jin-Gu Kang

This looks like a good extension to the type system. One thing that I expected to work (when I first tried using opaque types) was for the optimization passes and verification to operate correctly when using opaque types, while the type could be made concrete later on - after optimization. This turned out to be impossible, and in combination with the lack of naming for opaque types, I decided to avoid their use entirely. In particular, could you address whether this overhaul of the type system will address the following:

- The ability to create constants for named types without a defined structure
- Optimization involving PHI nodes for named types without a defined structure
- The ability to pass in named types to external function call arguments and return values

Basically, I'd like to be able to construct valid programs involving named types (without a defined structure), optimize/verify them, and then provide a structure for the type later on.

Andrew

This looks like a good extension to the type system. One thing that I
expected to work (when I first tried using opaque types) was for the
optimization passes and verification to operate correctly when using
opaque types, while the type could be made concrete later on - after
optimization. This turned out to be impossible, and in combination with
the lack of naming for opaque types, I decided to avoid their use
entirely. In particular, could you address whether this overhaul of the
type system will address the following:

- The ability to create constants for named types without a defined
structure

There is no change, this won't be allowed.

- Optimization involving PHI nodes for named types without a defined
structure

Not sure what you mean here, but I don't anticipate any change.

- The ability to pass in named types to external function call arguments
and return values

This is representable in the IR, but will blow up at codegen time, just like today.

Basically, I'd like to be able to construct valid programs involving
named types (without a defined structure), optimize/verify them, and
then provide a structure for the type later on.

I don't expect this series of changes to meaningfully impact this usage scenario, sorry!

-Chris

isPacked is still needed because it affects layout of the struct. Having a name (or not) doesn't impact layout. I don't know exactly what policy the frontend will use, but it would make sense for it to be an anonymous struct type.

-Chris

It would be legal. Type names have no semantic effect on the program. They aren't tied to TBAA or anything like that.

-Chris

- Optimization involving PHI nodes for named types without a defined
structure

Not sure what you mean here, but I don't anticipate any change.

I thinking along the lines of conversion to select instructions, and elimination of PHIs with identical incoming values - SSA-based transformations that are currently valid for concrete types.

- The ability to pass in named types to external function call arguments
and return values

This is representable in the IR, but will blow up at codegen time, just like today.

I'm not sure what you mean. Code generation requires concrete types, so it would still be necessary to specify the structure for all types before code generation time.

Last time I checked, it's not currently possible to pass opaque types to external functions and have a valid IR - this currently fails in the verifier. It is possible to pass pointers to opaque types, though.

Basically, I'd like to be able to construct valid programs involving
named types (without a defined structure), optimize/verify them, and
then provide a structure for the type later on.

I don't expect this series of changes to meaningfully impact this usage scenario, sorry!

-Chris

My two cents are that this is a useful usage scenario to consider if any major changes are being made to the type system. A ton of meaningful optimizations can be performed without requiring concrete types, and it would provide a nice interface for type system extension (without making core changes to the IR).

Andrew

Sure, I'm just saying that it is orthogonal to the set of problems that I'm looking to solve. I'm not opposed to someone else making this happen.

-Chris