Suggestion: Support union types in IR

I’ve been thinking about how to represent unions or “disjoint types” in LLVM IR. At the moment, the only way I know to achieve this right now is to create a struct that is as large as the largest type in the union and then bitcast it to access the fields contained within. However, that requires that the frontend know the sizes of all of the various low-level types (the “size_t” problem, which has been discussed before), otherwise you get problems trying to mix pointer and non-pointer types.

It seems to me that adding a union type to the IR would be a logical extension to the language. The syntax for declaring a union would be similar to that of declaring a struct. To access a union member, you would use GetElementPointer, just as if it were a struct. The only difference is that in this case, the GEP doesn’t actually modify the address, it merely returns the input argument as a different type. In all other ways, unions would be treated like structs, except that the size of the union would always be the size of the largest member, and all of the fields within the union would be located located at relative offset zero.

One issue would be how to represent the union in text form. One alternative is to use the same syntax for structs, except replace the comma with a vertical bar character, as in this example:

{int|float}

The drawback to this approach is that you can’t tell whether it is a struct or union until after you have parsed the first argument. If that turns out to be too inconvenient for the parser, then some unique two-character sequence will have to be used, such as:

{|int|float|}

Another idea is to use struct syntax with a keyword prefix:

union {int, float}

The specific details of syntax I will leave to others.

Unions could of course be combined with other types:

{{int|float}, bool} *
n = getelementptr i32 0, i32 0, i32 1

So in the above example, the GEP returns a pointer to the float field.

I've been thinking about how to represent unions or "disjoint types" in LLVM
IR.

This has been rejected in the past due to insufficient utility:
bitcasts have worked perfectly fine in the past...

At the moment, the only way I know to achieve this right now is to
create a struct that is as large as the largest type in the union and then
bitcast it to access the fields contained within. However, that requires
that the frontend know the sizes of all of the various low-level types (the
"size_t" problem, which has been discussed before), otherwise you get
problems trying to mix pointer and non-pointer types.

I thought we had a reasonable solution to the sizeof problem: the
getelementptr trick. Why isn't that usable here?

-Eli

Talin wrote:

I've been thinking about how to represent unions or "disjoint types" in LLVM IR. At the moment, the only way I know to achieve this right now is to create a struct that is as large as the largest type in the union and then bitcast it to access the fields contained within. However, that requires that the frontend know the sizes of all of the various low-level types (the "size_t" problem, which has been discussed before), otherwise you get problems trying to mix pointer and non-pointer types.

It has to come down to a fixed size at some point, using target-specific knowledge. Is there any reason you couldn't use 'opaque' here and resolve it once you have the necessary information?

Nick

I think very few people are still using C-style unions so there is probably
little desire to put them into LLVM itself. Class hierarchies and algebraic
datatypes are much more common and use a different representation (that LLVM
can already handle with bitcasts). Specifically, as a pointer to one of
several struct types that can vary in size.

I am considering a design along the following lines. All algebraic datatypes
are stored as a pointer to an int tag:

  i32 *

The tag conveys the true type of the value and, indeed, it may even be a
pointer to comprehensive run-time type information. So you load the tag and
examine it in order to determine what type to cast to. Then you bitcast to
another type if you need to load any arguments (that appear after the tag).
For example, an int option type might cast as follows:

  None (tag=0) -> {i32} *
  Some n (tag=1) -> {i32, i32} *

Assuming LLVM keeps the first i32 field of any struct in the same place, this
provides the required uniformity for algebraic datatypes and will also scale
to full class hierarchies.

I assume the C++ front-end for Clang does something similar to implement
classes?

I've been thinking about how to represent unions or "disjoint types" in LLVM IR. At the moment, the only way I know to achieve this right now is to create a struct that is as large as the largest type in the union and then bitcast it to access the fields contained within. However, that requires that the frontend know the sizes of all of the various low-level types (the "size_t" problem, which has been discussed before), otherwise you get problems trying to mix pointer and non-pointer types.

That's an interesting point. As others have pointed out, we've resisted having a union type because it isn't strictly needed for the current set of front-ends. If a front-end is trying to generate target-independent IR though, I can see the utility. The "gep trick" won't work for type generation.

It seems to me that adding a union type to the IR would be a logical extension to the language. The syntax for declaring a union would be similar to that of declaring a struct. To access a union member, you would use GetElementPointer, just as if it were a struct. The only difference is that in this case, the GEP doesn't actually modify the address, it merely returns the input argument as a different type. In all other ways, unions would be treated like structs, except that the size of the union would always be the size of the largest member, and all of the fields within the union would be located located at relative offset zero.

Yes, your proposal makes sense, for syntax, I'd suggest: u{ i32, float}

Unions could of course be combined with other types:

   {{int|float}, bool} *
   n = getelementptr i32 0, i32 0, i32 1

So in the above example, the GEP returns a pointer to the float field.

I don't have a specific problem with adding this. The cost of doing so is that it adds (a small amount of) complexity to a lot of places that walk the type graphs. The only pass that I predict will be difficult to update to handle this is the BasicAA pass, which reasons about symbolic (not concrete) offsets and should return mustalias in the appropriate cases. Also, to validate this, I think llvm-gcc should start generating this for C unions where possible.

If you're interested in implementing this and seeing all the details of the implementation through to the end, I don't see significant problems. I think adding a simple union type would make more sense than adding first-class support for a *discriminated* union.

-Chris

Responding to everyone in one post:

Eli Friedman wrote:

At the moment, the only way I know to achieve this right now is to
create a struct that is as large as the largest type in the union and then
bitcast it to access the fields contained within. However, that requires
that the frontend know the sizes of all of the various low-level types (the
"size_t" problem, which has been discussed before), otherwise you get
problems trying to mix pointer and non-pointer types.

I thought we had a reasonable solution to the sizeof problem: the
getelementptr trick. Why isn't that usable here?
  

As Chris points out, that won't work for type declarations.

Nick Lewycky wrote:

It has to come down to a fixed size at some point, using target-specific knowledge. Is there any reason you couldn't use 'opaque' here and resolve it once you have the necessary information?
  

Perhaps I've never quite understood the role of bitcode files. Assume I have a compiler that generates .bc files, and a linker which combines them into a runnable program. At which point is the target machine chosen, in the compiler or in the linker? The assumption that I have made up to this point is that the compiler produces .bc files which are target-neutral, and the linker then generates a target-specific program. (I realize that other design choices are possible.)

In order to do what you suggest, the linker needs to know all of the possible types that the union type can be. That means that the compiler has to transmit this knowledge to the linker in some way. Since the only communication between the compiler and the linker is bitcode (at least, in my current design), the types need to either be represented in-band (that is, representable within LLVM IR) or out-of-band (via some other mechanism.) It would seem to be cleaner if the information could be represented in-band.

Jon Harrop wrote:

I think very few people are still using C-style unions so there is probably little desire to put them into LLVM itself. Class hierarchies and algebraic datatypes are much more common and use a different representation (that LLVM can already handle with bitcasts). Specifically, as a pointer to one of several struct types that can vary in size.

I am considering a design along the following lines. All algebraic datatypes are stored as a pointer to an int tag:

  i32 *
  

Exactly. I'm not especially interested in C-style unions, I'm interested in discriminated unions. But the actual discriminator field is easily represented in LLVM IR already, so there's no need to extend the IR to support them. That's why I am only asking for C-style union support - it's a lower-level primitive that you can use to build discriminated unions on top of.

The problem with varying-size structs is that it would be nice to be able to pass them as function parameters and function return values using LLVM's support of structs as first-class data types. That requires that the code that handles argument marshalling know the size of the largest possible type that can be passed.

The tag conveys the true type of the value and, indeed, it may even be a pointer to comprehensive run-time type information. So you load the tag and examine it in order to determine what type to cast to. Then you bitcast to another type if you need to load any arguments (that appear after the tag). For example, an int option type might cast as follows:

  None (tag=0) -> {i32} *
  Some n (tag=1) -> {i32, i32} *
  

Since the language I am working on is statically typed, I was going to select a type discrimination strategy based on the actual types involved. For example, in the case of a union of reference types, there's no need for a discriminator field, since reference types already carry around their own type information. On the other hand, if you have a union of two enum types of limited range, the high bits of the enum would be available. In cases such as int or float where all the bits were used, then a separate discriminator field would be needed.

Chris Lattner wrote:

I don't have a specific problem with adding this. The cost of doing so is that it adds (a small amount of) complexity to a lot of places that walk the type graphs. The only pass that I predict will be difficult to update to handle this is the BasicAA pass, which reasons about symbolic (not concrete) offsets and should return mustalias in the appropriate cases. Also, to validate this, I think llvm-gcc should start generating this for C unions where possible.

If you're interested in implementing this and seeing all the details of the implementation through to the end, I don't see significant problems. I think adding a simple union type would make more sense than adding first-class support for a *discriminated* union.
  

Well, we'll see. Parsers and type hierarchies I grok. Backends and code generation, not so much. It will be a while before I am forced to solve this problem one way or another, but eventually I will.

-- Talin

Jon Harrop wrote:
> I think very few people are still using C-style unions so there is
> probably little desire to put them into LLVM itself. Class hierarchies
> and algebraic datatypes are much more common and use a different
> representation (that LLVM can already handle with bitcasts).
> Specifically, as a pointer to one of several struct types that can vary
> in size.
>
> I am considering a design along the following lines. All algebraic
> datatypes are stored as a pointer to an int tag:
>
> i32 *

Exactly. I'm not especially interested in C-style unions, I'm interested
in discriminated unions. But the actual discriminator field is easily
represented in LLVM IR already, so there's no need to extend the IR to
support them. That's why I am only asking for C-style union support -
it's a lower-level primitive that you can use to build discriminated
unions on top of.

I don't think you would want to build discriminated unions on top of C-style
unions though.

The problem with varying-size structs is that it would be nice to be
able to pass them as function parameters and function return values
using LLVM's support of structs as first-class data types.

As I understand it, LLVM does not provide first-class structs and attempting
to return arbitrary structs from functions will randomly break. I believe you
need to implement that yourself in whatever way your target language
requires.

That requires that the code that handles argument marshalling know the size
of the largest possible type that can be passed.

Pass a pointer and that size will always be 1 word. You can still stack
allocate the space if you must and LLVM's optimization passes will do their
best to completely remove that overhead when possible.

> The tag conveys the true type of the value and, indeed, it may even be a
> pointer to comprehensive run-time type information. So you load the tag
> and examine it in order to determine what type to cast to. Then you
> bitcast to another type if you need to load any arguments (that appear
> after the tag). For example, an int option type might cast as follows:
>
> None (tag=0) -> {i32} *
> Some n (tag=1) -> {i32, i32} *

Since the language I am working on is statically typed, I was going to
select a type discrimination strategy based on the actual types
involved. For example, in the case of a union of reference types,
there's no need for a discriminator field,

Only if they are all different reference types. Otherwise you might have the
situation:

  type t = Negate of t | Factorial of t

Where both constructors carry an argument of the same type (t) and you do
still need a tag to distinguish between the cases (Negate and Factorial) in
the union.

since reference types already
carry around their own type information. On the other hand, if you have
a union of two enum types of limited range, the high bits of the enum
would be available. In cases such as int or float where all the bits
were used, then a separate discriminator field would be needed.

There is certainly some scope here and I have been thinking about similar
things. I would like both run-time type information and efficient pattern
matching over algebraic datatypes.

The run-time type information can be a field from every reference type that
points to its run-time type. When pattern matching, you usually want to
bisect the number of cases covered in the union to perform a balanced search
and achieve the best performance. That usually assumes the presence of
consecutive integers as tags so the use of effectively-random pointers would
throw that off a bit, especially if the run-time type information is
subjected to a moving garbage collector. Alternatively, I could put a
redundant tag in each value on the heap...

Exactly. I'm not especially interested in C-style unions, I'm interested
in discriminated unions. But the actual discriminator field is easily
represented in LLVM IR already, so there's no need to extend the IR to
support them. That's why I am only asking for C-style union support -
it's a lower-level primitive that you can use to build discriminated
unions on top of.

I don't think you would want to build discriminated unions on top of C-style
unions though.

Why?

The problem with varying-size structs is that it would be nice to be
able to pass them as function parameters and function return values
using LLVM's support of structs as first-class data types.

As I understand it, LLVM does not provide first-class structs and attempting
to return arbitrary structs from functions will randomly break. I believe you
need to implement that yourself in whatever way your target language
requires.

This is a current implementation deficiency, not a design limitation. We welcome patches to fix this.

-Chris

Nick Lewycky wrote:

It has to come down to a fixed size at some point, using target-specific
knowledge. Is there any reason you couldn't use 'opaque' here and
resolve it once you have the necessary information?

Perhaps I've never quite understood the role of bitcode files. Assume I
have a compiler that generates .bc files, and a linker which combines
them into a runnable program. At which point is the target machine
chosen, in the compiler or in the linker? The assumption that I have
made up to this point is that the compiler produces .bc files which are
target-neutral, and the linker then generates a target-specific program.
(I realize that other design choices are possible.)

Well there are two answers to this. Ideally, LLVM would work as you say. Unfortunately some of the current implementation has been tainted by LLVM's primary development being guided by the needs of C compilers. C compilers inherently need the front-end to encode target-specific information in the IR.

However, I really do want to serve portable languages better. Ideally, the first time that target-specificity comes into play is at the "llc" level. I think we are very close to this, but I believe there are still a few problem areas.

That said, I don't think this is a reason *not to add* the 'portable union' feature, I just think it's a reason we *haven't already* added it.

Exactly. I'm not especially interested in C-style unions, I'm interested
in discriminated unions. But the actual discriminator field is easily
represented in LLVM IR already, so there's no need to extend the IR to
support them. That's why I am only asking for C-style union support -
it's a lower-level primitive that you can use to build discriminated
unions on top of.

Yes, I completely agree with you.

-Chris

Exactly. I'm not especially interested in C-style unions, I'm interested
in discriminated unions. But the actual discriminator field is easily
represented in LLVM IR already, so there's no need to extend the IR to
support them. That's why I am only asking for C-style union support -
it's a lower-level primitive that you can use to build discriminated
unions on top of.

I don't think you would want to build discriminated unions on top of C-style
unions though.

Why?

The problem with varying-size structs is that it would be nice to be
able to pass them as function parameters and function return values
using LLVM's support of structs as first-class data types.

As I understand it, LLVM does not provide first-class structs and attempting
to return arbitrary structs from functions will randomly break. I believe you
need to implement that yourself in whatever way your target language
requires.

This is a current implementation deficiency, not a design limitation. We welcome patches to fix this.

-Chris

Uniformity when nesting and space efficiency. Users of a language front-end
will want to nest discriminated unions, e.g. to manipulate trees. C-style
unions are designed to be unboxed so they are useless for that unless you
introduce pointer indirections but if you store all unions by reference then
the uniformity of the 1-word pointer size obviates the need to use same
(largest) size union cases so you would drop C-style unions in favor of a
more space-efficient alternative anyway. Otherwise, you might opt for a
non-uniform representation where local unions are unboxed but unions stored
within them are boxed, which is needless complexity but it might buy a little
performance.

So I think most LLVM users will represent their discriminated unions as
something like:

  {i32, {..} *}

or:

  {i32, {..}} *

but not as C-style unions.

Okay, so you're just talking about boxed vs unboxed discriminated unions, or "by ref" vs "by value" discriminated unions. Clearly the LLVM IR support for "c style unions" would only be useful for "unboxed" or "byvalue" discriminated unions. That doesn't mean that *your* specific uses would need them. If you're doing "by-ref" or "boxed" unions, then our current support should already be sufficient.

-Chris

>>> I don't think you would want to build discriminated unions on top of
>>> C-style unions though.
>>
>> Why?
>
> Uniformity when nesting and space efficiency. Users of a language
> front-end
> will want to nest discriminated unions, e.g. to manipulate trees.

Okay, so you're just talking about boxed vs unboxed discriminated
unions, or "by ref" vs "by value" discriminated unions. Clearly the
LLVM IR support for "c style unions" would only be useful for
"unboxed" or "byvalue" discriminated unions. That doesn't mean that
*your* specific uses would need them.

Yes. I expect relatively few people would need C-style unions.

If you're doing "by-ref" or "boxed" unions, then our current support should
already be sufficient.

Yes, LLVM's existing support is fine.

Incidentally, is there any way for LLVM users to vote on a "wish list" in such
a way that developers can see which features are most sought after?

Okay, so you're just talking about boxed vs unboxed discriminated
unions, or "by ref" vs "by value" discriminated unions. Clearly the
LLVM IR support for "c style unions" would only be useful for
"unboxed" or "byvalue" discriminated unions. That doesn't mean that
*your* specific uses would need them.

Yes. I expect relatively few people would need C-style unions.

We obviously don't have them yet, so there hasn't been a driving need to make it happen.

If you're doing "by-ref" or "boxed" unions, then our current support should
already be sufficient.

Yes, LLVM's existing support is fine.

Incidentally, is there any way for LLVM users to vote on a "wish list" in such
a way that developers can see which features are most sought after?

Nope, but you can add things to the open projects list if you want.

-Chris

I wanted to mention, by the way, that my need/desire for this hasn't gone away :slight_smile:

And my wish list still includes support for something like uintptr_t - a primitive integer type that is defined to always be the same size as a pointer, however large or small that may be on different platforms. (So that the frontend doesn't need to know how big a pointer is and can generate the same IR that works on both 32-bit and 64-bit platforms.)

-- Talin

Chris Lattner wrote:

I wanted to mention, by the way, that my need/desire for this hasn't
gone away :slight_smile:

And my wish list still includes support for something like uintptr_t - a
primitive integer type that is defined to always be the same size as a
pointer, however large or small that may be on different platforms. (So
that the frontend doesn't need to know how big a pointer is and can
generate the same IR that works on both 32-bit and 64-bit platforms.)

Why not just use a pointer, such as i8*?

-Chris

Chris Lattner wrote:

I wanted to mention, by the way, that my need/desire for this hasn't
gone away :slight_smile:

And my wish list still includes support for something like uintptr_t - a
primitive integer type that is defined to always be the same size as a
pointer, however large or small that may be on different platforms. (So
that the frontend doesn't need to know how big a pointer is and can
generate the same IR that works on both 32-bit and 64-bit platforms.)
    
Why not just use a pointer, such as i8*?
  

Suppose I have an STL-like container that has a 'begin' and 'end' pointer. Now I want to find the size() of the container. Since you cannot subtract pointers in LLVM IR, you have to cast them to an integer type first. But what integer type do you cast them to? I suppose you could simply always cast them to i64, and hope that the backend will generate efficient code for the subtraction, but I have no way of knowing this.

Now, I'm going to anticipate what I think will be your next argument, which is that at some point I must know the size of the result since I am assigning the result of size() to some interger variable eventually. Which is true, however, if the size of that eventual variable is smaller than a pointer, then I want to check it for overflow before I do the assignment. I don't want to just do a blind bitcast and have the top bits be lopped off.

The problem of checking for overflow when assigning from an integer of unknown size to an integer of known size is left as an exercise for the reader.

The TargetData class gives you the size of the pointer. (getPointerSize and getPointerSizeInBits).
Can it help you ?

Chris Lattner wrote:

I wanted to mention, by the way, that my need/desire for this hasn't
gone away :slight_smile:

And my wish list still includes support for something like uintptr_t
- a
primitive integer type that is defined to always be the same size as a
pointer, however large or small that may be on different platforms.
(So
that the frontend doesn't need to know how big a pointer is and can
generate the same IR that works on both 32-bit and 64-bit platforms.)

Why not just use a pointer, such as i8*?

Suppose I have an STL-like container that has a 'begin' and 'end'
pointer. Now I want to find the size() of the container. Since you
cannot subtract pointers in LLVM IR, you have to cast them to an integer
type first. But what integer type do you cast them to? I suppose you
could simply always cast them to i64, and hope that the backend will
generate efficient code for the subtraction, but I have no way of
knowing this.

OT, and I don't have any deep insights, but: Nick and I recently added
IRBuilder::CreatePtrDiff() to do this subtraction for you. It returns
an i64. I believe the target-aware optimization passes keep track of
which bits may be set in any integer, which would let them generate
efficient code for the subtraction, but I don't know exactly which
optimizations would do that. You can, of course, check an i64 for
overflow when casting to i16, or whatever, and I'd hope the bit-aware
optimizations would convert the overflow check to 'false' if you
weren't actually truncating.

Chris Lattner wrote:

  

I've been thinking about how to represent unions or "disjoint types" in LLVM IR. At the moment, the only way I know to achieve this right now is to create a struct that is as large as the largest type in the union and then bitcast it to access the fields contained within. However, that requires that the frontend know the sizes of all of the various low-level types (the "size_t" problem, which has been discussed before), otherwise you get problems trying to mix pointer and non-pointer types.
    
That's an interesting point. As others have pointed out, we've resisted having a union type because it isn't strictly needed for the current set of front-ends. If a front-end is trying to generate target-independent IR though, I can see the utility. The "gep trick" won't work for type generation.

It seems to me that adding a union type to the IR would be a logical extension to the language. The syntax for declaring a union would be similar to that of declaring a struct. To access a union member, you would use GetElementPointer, just as if it were a struct. The only difference is that in this case, the GEP doesn't actually modify the address, it merely returns the input argument as a different type. In all other ways, unions would be treated like structs, except that the size of the union would always be the size of the largest member, and all of the fields within the union would be located located at relative offset zero.
    
Yes, your proposal makes sense, for syntax, I'd suggest: u{ i32, float}

Unions could of course be combined with other types:

   {{int|float}, bool} *
   n = getelementptr i32 0, i32 0, i32 1

So in the above example, the GEP returns a pointer to the float field.
    
I don't have a specific problem with adding this. The cost of doing so is that it adds (a small amount of) complexity to a lot of places that walk the type graphs. The only pass that I predict will be difficult to update to handle this is the BasicAA pass, which reasons about symbolic (not concrete) offsets and should return mustalias in the appropriate cases. Also, to validate this, I think llvm-gcc should start generating this for C unions where possible.

If you're interested in implementing this and seeing all the details of the implementation through to the end, I don't see significant problems. I think adding a simple union type would make more sense than adding first-class support for a *discriminated* union.
  

So, I spent a little bit of time looking at how to code this. Rather than defining a brand-new LLVM type, I think the best approach is to generalize the existing StructType. We can change the "isPacked" flag (which is stored in the SubclassData of the type) to a 3-valued enum called StructLayout, which has 3 members: UnpackedLayout, PackedLayout, and UnionLayout.

After that, it's mostly a case of hunting down everywhere that calls "isPacked", and adding the appropriate case. :slight_smile: