Suggestion: Support union types in IR

Talin wrote:

Chris Lattner wrote:

  

I've been thinking about how to represent unions or "disjoint types" in LLVM IR. At the moment, the only way I know to achieve this right now is to create a struct that is as large as the largest type in the union and then bitcast it to access the fields contained within. However, that requires that the frontend know the sizes of all of the various low-level types (the "size_t" problem, which has been discussed before), otherwise you get problems trying to mix pointer and non-pointer types.
    

That's an interesting point. As others have pointed out, we've resisted having a union type because it isn't strictly needed for the current set of front-ends. If a front-end is trying to generate target-independent IR though, I can see the utility. The "gep trick" won't work for type generation.

It seems to me that adding a union type to the IR would be a logical extension to the language. The syntax for declaring a union would be similar to that of declaring a struct. To access a union member, you would use GetElementPointer, just as if it were a struct. The only difference is that in this case, the GEP doesn't actually modify the address, it merely returns the input argument as a different type. In all other ways, unions would be treated like structs, except that the size of the union would always be the size of the largest member, and all of the fields within the union would be located located at relative offset zero.
    

Yes, your proposal makes sense, for syntax, I'd suggest: u{ i32, float}

Unions could of course be combined with other types:

   {{int|float}, bool} *
   n = getelementptr i32 0, i32 0, i32 1

So in the above example, the GEP returns a pointer to the float field.
    

I don't have a specific problem with adding this. The cost of doing so is that it adds (a small amount of) complexity to a lot of places that walk the type graphs. The only pass that I predict will be difficult to update to handle this is the BasicAA pass, which reasons about symbolic (not concrete) offsets and should return mustalias in the appropriate cases. Also, to validate this, I think llvm-gcc should start generating this for C unions where possible.

If you're interested in implementing this and seeing all the details of the implementation through to the end, I don't see significant problems. I think adding a simple union type would make more sense than adding first-class support for a *discriminated* union.
  

So, I spent a little bit of time looking at how to code this. Rather than defining a brand-new LLVM type, I think the best approach is to generalize the existing StructType. We can change the "isPacked" flag (which is stored in the SubclassData of the type) to a 3-valued enum called StructLayout, which has 3 members: UnpackedLayout, PackedLayout, and UnionLayout.

After that, it's mostly a case of hunting down everywhere that calls "isPacked", and adding the appropriate case. :slight_smile:

Hi Talin, thanks for looking in to this.

I think there's already a lot of code that makes the assumption that the different members of a StructType represent distinct storage. Why is it easier to co-opt StructType than to create a new 'UnionType' under CompositeType? What's the tradeoff?

Nick

Nick Lewycky wrote:

Hi Talin, thanks for looking in to this.

I think there's already a lot of code that makes the assumption that the different members of a StructType represent distinct storage. Why is it easier to co-opt StructType than to create a new 'UnionType' under CompositeType? What's the tradeoff?
  

Well, from a pure lines-of-code perspective, the struct and union types share about 90% of their implementation, and it seemed wrong to duplicate that code. I thought about factoring out a common base class, however that seemed more disruptive to the code base than I really wanted to deal with.

However, I haven't really looked into the parts of LLVM that actually do the analysis. (I'm mostly a front-end guy.) So I don't yet understand how much trouble such a change would cause.

(So
that the frontend doesn't need to know how big a pointer is and can
generate the same IR that works on both 32-bit and 64-bit platforms.)

Why not just use a pointer, such as i8*?

Suppose I have an STL-like container that has a 'begin' and 'end'
pointer. Now I want to find the size() of the container. Since you
cannot subtract pointers in LLVM IR, you have to cast them to an integer
type first. But what integer type do you cast them to? I suppose you
could simply always cast them to i64, and hope that the backend will
generate efficient code for the subtraction, but I have no way of
knowing this.

What do you intend to do with this size? At some point you have to do an operation that depends on the size. The problem with adding an intptr_t is that at some point you have to convert it to another datatype, and you don't know whether it will be a zext/trunc/bitcast etc.

Now, I'm going to anticipate what I think will be your next argument,
which is that at some point I must know the size of the result since I
am assigning the result of size() to some interger variable eventually.

yes :slight_smile:

Which is true, however, if the size of that eventual variable is smaller
than a pointer, then I want to check it for overflow before I do the
assignment. I don't want to just do a blind bitcast and have the top
bits be lopped off.

But you don't know if it will be smaller or larger.

-Chris

Nick Lewycky wrote:

Hi Talin, thanks for looking in to this.

I think there's already a lot of code that makes the assumption that the different members of a StructType represent distinct storage. Why is it easier to co-opt StructType than to create a new 'UnionType' under CompositeType? What's the tradeoff?

So, I'm not sure what path to take now, or how to proceed on this. Anyone have an opinion as to whether unions should be a brand new type, or a variation of struct? The union class will share a lot of the same internal stuff as struct, such as the methods for serializing and managing the memory for the member list.

In the mean time, I've decided to use an evil hack to get around the lack of proper union support. What I do is to examine all of the types in the union and estimate which one will be the largest. However, since my compiler is platform-independent, and doesn't know how big a pointer is, my calculation for "largest" is fairly complex - I calculate the size of all of the union members based on the assumption that pointers are 32-bits, and then filter out all members which are smaller than some other member in the set. I then do the same calculation but this time assume that pointers are 64 bits. Now I have two "largest" sets, I then intersect them to see if they have any entries in common. If they do, I pick one, and create a struct containing that type. This struct is guaranteed to be large enough to hold any union member on either a 32-bit or 64-bit platform. If the two sets don't have any entries in common, then I just fail and print a message saying wait until proper union support is done :slight_smile:

This hack at least gets me far enough so that in my language I can now compile and run code that looks like this:

  def testSimpleUnionType() {
    var x:int or float;
    x = 1;
    Debug.assertTrue(x isa int);
    Debug.assertFalse(x isa float);
    Debug.assertFalse(x isa String);
    x = 1.0;
    Debug.assertTrue(x isa float);
    Debug.assertFalse(x isa int);
    Debug.assertFalse(x isa String);
  }

-- Talin

You may want to feed it back into GEP. If size() for the container is
the first step in calculating a pivot point for qsort, then as long as
GEP takes in intptr_t you are set. In fact, if intptr_t existed, it
would be all a GEP or malloc had to take and those instructions
wouldn't need the current bi-version approach (nor any other thing
that takes a size, like the mem* intrinsics). I would go so far as to
say that ptr->int and int->ptr cases should only be possible to an
intptr type, and then the target knowledge which currently is embedded
in the ptr cast is embedded in the int cast if you need to cast to a
known size.

Andrew