[PATCH] - Union types, attempt 2

This patch adds a UnionType to DerivedTypes.h. It also adds code to the bitcode reader / writer and the assembly parser for the new type, as well as a tiny .ll test file in test/Assembler. It does not contain any code related to code generation or type layout - I wanted to see if this much was acceptable before I proceeded any further.

Unlike my previous patch, in which the Union type was implemented as a packing option to type Struct (thereby re-using the machinery of Struct), this patch defines Union as a completely separate type from Struct.

I was a little uncertain as to how to write the tests. I’d particularly like to write tests for the bitcode reader/writer stuff, but I am not sure how to do that.

uniontype.patch (22.6 KB)

Anyone want to take a look at this and tell me if I am on a reasonable path?

This patch adds a UnionType to DerivedTypes.h.

Cool. When proposing an IR extension, it is usually best to start with a LangRef.html patch so that we can discuss the semantics of the extension. Please do write this before you get much farther. I assume that you want unions usable in the same situations as a struct. However, how do “constant unions” work? How do I initialize a global variable whose type is “union of float and i32” for example?

It also adds code to the bitcode reader / writer and the assembly parser for the new type, as well as a tiny .ll test file in test/Assembler. It does not contain any code related to code generation or type layout - I wanted to see if this much was acceptable before I proceeded any further.

The .ll file isn’t included in your patch, but I see that you chose a syntax of ‘union { i32, float }’ which seems very reasonable.

Unlike my previous patch, in which the Union type was implemented as a packing option to type Struct (thereby re-using the machinery of Struct), this patch defines Union as a completely separate type from Struct.

I think this approach makes sense. It means that things like TargetData StructLayout won’t work for unions, but you don’t need them since all elements are at offset 0.

I was a little uncertain as to how to write the tests. I’d particularly like to write tests for the bitcode reader/writer stuff, but I am not sure how to do that.

A reasonable example are tests like test/Feature/newcasts.ll

Here are some thoughts on your patch:

+class UnionType : public CompositeType {

  • /// UnionType::get - Create an empty union type.
  • ///
  • static UnionType *get(LLVMContext &Context) {
  • return get(Context, std::vector<const Type*>());
  • }

I don’t think that an empty union is going to be important enough to add a special accessor for it. Is an empty union ever a useful thing to do? If you completely disallow them from IR, it would end up simplifying some things. We don’t allow empty vectors, and it seems that an empty union has exactly the same semantics as an empty struct, so having both empty structs and empty unions doesn’t seem necessary.

  • static UnionType *get(LLVMContext &Context,
  • const std::vector<const Type*> &Params);

Since this is new code, please have the constructor method take a ‘const Typeconst Elements, unsigned NumElements’ instead of requiring the caller to make an std::vector. This allows use of SmallVector etc. It is desirable to do this for all the other type classes in DerivedTypes.h, but we haven’t gotten around to doing it yet.

  • /// UnionType::get - This static method is a convenience method for
  • /// creating union types by specifying the elements as arguments.
  • /// Note that this method always returns a non-packed struct. To get
  • /// an empty struct, pass NULL, NULL.
  • static UnionType *get(LLVMContext &Context,
  • const Type *type, …) END_WITH_NULL;

Please update the comments. Also, if you disallow empty unions here, you don’t need to pass a context.

+++ include/llvm/Type.h (working copy)
@@ -86,6 +86,7 @@
PointerTyID, ///< 12: Pointers
OpaqueTyID, ///< 13: Opaque: type with unknown structure
VectorTyID, ///< 14: SIMD ‘packed’ format, or other vector type

  • UnionTyID, ///< 15: Unions

Please put this up next to Struct for simplicity, the numbering here doesn’t need to be stable. The numbering in llvm-c/Core.h does need to be stable though.

+bool UnionType::indexValid(const Value *V) const {

  • // Union indexes require 32-bit integer constants.
  • if (V->getType() == Type::getInt32Ty(V->getContext()))

Please use V->getType()->isInteger(32) which is probably new since you started your patch.

+UnionType::UnionType(LLVMContext &C, const std::vector<const Type*> &Types)

  • : CompositeType(C, UnionTyID) {
  • ContainedTys = reinterpret_cast<PATypeHandle*>(this + 1);
  • NumContainedTys = Types.size();
  • bool isAbstract = false;
  • for (unsigned i = 0; i < Types.size(); ++i) {

No need to evaluate Types.size() every time through the loop.

+bool LLParser::ParseUnionType(PATypeHolder &Result) {

  • if (!EatIfPresent(lltok::lbrace)) {
  • return Error(EltTyLoc, “‘{’ expected after ‘union’”);
  • }

Please use:
if (ParseToken(lltok::lbrace, “‘{’ expected after ‘union’”)) return true;

  • EltTyLoc = Lex.getLoc();
  • if (ParseTypeRec(Result)) return true;
  • ParamsList.push_back(Result);

Quick question - should unions enforce that all member types are unique? I realize that a union of { i32, i32 } doesn’t make sense, but should the code actually forbid this?

As far as constants go, as long as the initializer is an exact match for one of the member types, it should be no problem.

Quick question - should unions enforce that all member types are unique? I realize that a union of { i32, i32 } doesn't make sense, but should the code actually forbid this?

Either way works for me.

As far as constants go, as long as the initializer is an exact match for one of the member types, it should be no problem.

Right, please propose a syntax and a class to use (ConstantUnion?) for it,

-Chris

I’m working on a new version of the patch.

Another thing I wanted to ask about - do you prefer to have one giant patch that has everything, or a series of incremental patches? I can see advantages either way.

Normally I would want to do this as a series of incremental patches, however this is a rather large project and it may take me quite a while before it’s completely done. I don’t doubt that I will need some assistance when it comes to the trickier parts (like the optimization aspects you mentioned.) So there’s a risk involved in submitting the first one or two patches, because the final patch might not be ready in time for the next release.

On the other hand, it will be a lot easier for others to assist if we go ahead and submit the initial work.

I'm working on a new version of the patch.

Another thing I wanted to ask about - do you prefer to have one giant patch that has everything, or a series of incremental patches? I can see advantages either way.

A series of incremental patches is strongly preferred, starting with LangRef.html.

Normally I would want to do this as a series of incremental patches, however this is a rather large project and it may take me quite a while before it's completely done. I don't doubt that I will need some assistance when it comes to the trickier parts (like the optimization aspects you mentioned.) So there's a risk involved in submitting the first one or two patches, because the final patch might not be ready in time for the next release.

On the other hand, it will be a lot easier for others to assist if we go ahead and submit the initial work.

No problem, just submit it as you go. When the langref piece goes in, just say in it that this is an experimental feature in development. Thanks Talin,

-Chris

Here is the LangRef part of the patch.

unionref.patch (8.14 KB)

[...]

This wording is somewhat misleading; memory in LLVM has no types.
How about:

"A union type describes an object with size and alignment suitable for
an object of any one of a given set of types."

Also, is it really useful to support
insertvalue/extractvalue/getelementptr on unions? The benefit of unions
that I'm aware of is it allows target-independent IR to work with
appropriately sized and aligned memory. This doesn't require any special
support for accessing union members; for example:

  %p = alloca union { i32, double }
  %q = bitcast union { i32, double }* %p to double*
  store i32 2.0, double* %q

Would this be a reasonable approach?

Dan

Here is the LangRef part of the patch.

+

The union type is used to represent a set of possible data types which can

  • exist at a given location in memory (also known as an “untagged”
  • union).
    […]

This wording is somewhat misleading; memory in LLVM has no types.
How about:

“A union type describes an object with size and alignment suitable for
an object of any one of a given set of types.”

OK

Also, is it really useful to support
insertvalue/extractvalue/getelementptr on unions? The benefit of unions
that I’m aware of is it allows target-independent IR to work with
appropriately sized and aligned memory. This doesn’t require any special
support for accessing union members; for example:

%p = alloca union { i32, double }
%q = bitcast union { i32, double }* %p to double*
store i32 2.0, double* %q

Would this be a reasonable approach?

It depends on whether or not unions can be passed around as SSA values or not. I can think of situations where you would want to.

In particular, GEP is useful because you can avoid the bitcast above - GEP to element 0 if you want an int (in the example above), or to element 1 if you want a double.

Also, I’m thinking that insertvalue might be the best way to construct a constant union. Right now there’s a bit of a problem in that the data type of a constant must match exactly the declared type; However in the case of a union what we want is for the data type of the initializer to exactly match the type of one of the union members. I thought this would be relatively easy to do, but it’s a little trickier than I realized.

I’m still working on the next patch, it’s going somewhat slowly. I wanted to create a unit test that actually created a union, and in order to do that I had to implement constant unions. And rather than creating a special syntax for constructing a union, I decided that it was simplest to implement the insertvalue instruction for a constant union expression:

@foo = constant union { i32, float } insertvalue union { i32, float } undef, i32 4, 0

What this says is to start with an undef, and then insert the value ‘4’ into the integer field (the zeroth field) of the union.

The reason for doing it this way is that to construct a union, you really need 4 pieces of information: The type of the union, the type and value of the member to be initialized, and the index of which member is being initialized. Originally I thought about having the last be detected automatically by what type of initializer was used:

@foo = constant union { i32, float } i32 4

However, from a syntactical standpoint what you get is two types in a row - “union { i32, float }” followed by “i32”. That is completely unlike any other IR syntax and doesn’t fit well into the parser. Using insertvalue as an initializer has the advantage that it’s parameters supply all of information we need. The disadvantage is that you have to type the union type signature twice, but I doubt that will be a major issue since IR isn’t meant to be typed by hand anyway.

I can think of another benefit of the insertvalue/extractvalue/etc
approach, which is that then LLVM checks your types for you. In other
words, it makes sure you don't put an i64 into a union of i32 and
double, whereas you can always bitcast an i64 to that union.

Reid

Does requiring the index mean that uniquing the union type will have
to re-write many of the corresponding insertvalue calls?

For instance, how would this round-trip?

    @foo = constant union { float, i32 } insertvalue union { i32,
float } undef, i32 4, 0
    @bar = constant union { i32, float } insertvalue union { float,
i32 } undef, i32 4, 1

I'm very glad to see a non-bitcast method of using unions, BTW.

2010/1/14 Talin <viridia@gmail.com>:

The reason for doing it this way is that to construct a union, you really
need 4 pieces of information: The type of the union, the type and value of
the member to be initialized, and the index of which member is being
initialized.

Does requiring the index mean that uniquing the union type will have
to re-write many of the corresponding insertvalue calls?

For instance, how would this round-trip?

@foo = constant union { float, i32 } insertvalue union { i32,

float } undef, i32 4, 0

@bar = constant union { i32, float } insertvalue union { float,
i32 } undef, i32 4, 1

Well, the fact that union members have to be indexed by number means that the ordering has to be part of the type - so even though type-theoretically union { i32, float } is the same as union { float, i32 }, in my implementation they are distinct types. However, from the standpoint of a frontend, this is not a great concern, because the frontend will most likely sort the list of types before constructing the IR type. By always putting the types in a canonical order, regardless of the order that they appear in the source code, you can ensure that unions of equal types are always compatible. In other words, you can treat the members like an ordered set rather than like a list.

Does that mean that my example above is ill-formed, since the
insertvalue gives a different type than the constant wants?

Yes.

Talin schrieb:

Well, the fact that union members have to be indexed by number means that the ordering has to be part of the type - so even though type-theoretically union { i32, float } is the same as union { float, i32 }, in my implementation they are distinct types. However, from the standpoint of a frontend, this is not a great concern, because the frontend will most likely sort the list of types before constructing the IR type.

Hm... it's placing a burden on the frontend developer.

More importantly, it's something that the fronend developer must not forget to do, so you better make sure this is documented in capital letters in a place where the frontend developer is likely to look when preparing code generation.

Most importantly, however, this will create a lot of hassles when making code interoperable between compilers: Compiler writers need to agree on a language-independent canonical ordering.
That said, if the ordering is canonical, it could be established at the IR level. E.g. by ordering alphabetically.

When coding, please consider that many languages establish assignment compatibility between union types. E.g. a union {i32, float} value could be assigned to a name that's typed as a union {i32, i64, float}.
This probably means the need for conversion operators, and it definitely means that indexes aren't meaningful by themselves, only in conjunction with their union type.

> By always putting the types in a canonical order, regardless of

the order that they appear in the source code, you can ensure that unions of equal types are always compatible. In other words, you can treat the members like an ordered set rather than like a list.

Yes, that's closer to the frontend semantics: the variants of a union type don't have any natural ordering, so list semantics could cause problems.

Regards,
Jo

It seems to me like that's similar to the ptrtoint and such
instructions that change the type and not necessarily the value, so
perhaps the main union operations should be:

    %foo = elementtounion i32 4 to union { i32, float }
    %bar = uniontoelement union { i32, float } %foo to i32

Conceptually I dislike the insertvalue, since the point of an insert
is to keep some other part of the value intact, something not needed
with unions.

I think it is useful to support insert/extract on unions just for orthogonality alone. Stuff that works on structs should generally work on unions too.

-Chris

I'm still working on the next patch, it's going somewhat slowly. I wanted to create a unit test that actually created a union, and in order to do that I had to implement constant unions. And rather than creating a special syntax for constructing a union, I decided that it was simplest to implement the insertvalue instruction for a constant union expression:

    @foo = constant union { i32, float } insertvalue union { i32, float } undef, i32 4, 0

What this says is to start with an undef, and then insert the value '4' into the integer field (the zeroth field) of the union.

Insertvalue constant exprs should work on these, but that should fall out from insertvalue just working on unions. However:

The reason for doing it this way is that to construct a union, you really need 4 pieces of information: The type of the union, the type and value of the member to be initialized, and the index of which member is being initialized. Originally I thought about having the last be detected automatically by what type of initializer was used:

    @foo = constant union { i32, float } i32 4

I think we really do need a "ConstantUnion" class. How about syntax like this:

     @foo = constant union { i32, float, double, i32*, i32 } { i32 4 }

That seems simple and unambiguous, and analogous to structs.

-Chris