Union Type

As a side effect of bug 178 (Stacker not handling 64-bit pointers on
Solaris), I got thinking about a union type for LLVM. Is there any
good reason that LLVM shouldn't support unions? This is essentially a
structure that has its members all at the same address rather than at
sequential addresses. I know there are various issues with unions
(alignment, etc.) but wouldn't it make sense to provide a union type
that deals with all those issues in a platform independent way?

The reason this comes up is because the idiom of saving space by using a
memory location for storing different types of data is quite frequent.
For example, suppose you want to store both an "int" and a "char*" in a
single slot in an array. Each slot can have only one or the other type
of value at any given time. There are three ways to do this: structure,
casting, or union:

1: % foo = type { int, char* };
2: % foo = type { int };
3: % foo = union { int, char* };

Number 3 doesn't exist in LLVM and is what I'm proposing. In the first
case, we incur a memory object that has both an int and a char* with
non-overlapping sequential addresses. This wastes space since both the
int and the char* will never be concurrently used. In the second case
we have just an int that could be casted to a char* but that might cause
undefined results if the size of a char* is larger than the size of an
int.

The third option, union, is the compromise. It says, "make the memory
object as large as the largest element but have all elements start at
(or near) the same memory address". The "or near" part is necessary
because alignment rules might cause one of the members to start at a
non-zero offset from the start of the memory object.

In my particular case in Stacker, I have tried to do something like:

%foo = global [10 x int];

void %func() {
    %int_ptr = getelementptr [10 x int]* %foo, long 0, long 0;
    %int_val = load int* %int_ptr;
    %char_ptr = cast int %int_val to char*;
    %oops = load char* %char_ptr;
}

The above will probably work on a 32-bit platform where pointers are the
same size as int. However, on a 64-bit platform, a pointer is the same
size as a long and the value retrieved from the array would actually
span two entries in the array, one of which could have been corrupted by
a previous write of an integer into the array. Yes, I know, I should
have chosen the pointer type as the basis for the array .. but then, it
wouldn't work reliably on an 8086 with segmented memory model :slight_smile:

While various tests for word sizes and alignment rules could be used,
this problem is _gracefully_ handled by unions. To rewrite the example
above we would use something like:

%a_union = union { int, char* };

%foo = global [ 10 x %a_union ];

void %func() {
    %int_ptr = getelementptr [10 x %a_union]* %foo,
        long 0, long 0, ubyte 0;
    %int_val = load int* %int_ptr;
    %char_ptr = getelementptr [10 x %a_union]* %foo,
        long 0, long 0, ubyte 1;
    %good = load char* %char_ptr;
}

This effectively does the same thing as the first but the union takes
care of the word length and alignment issues for us.

If anyone thinks that unions are bad ideas, I challenge you to create a
computer that doesn't support an OR operation. For data structures,
unions fill the same role: structures are AND, unions are OR. Unions
only get dicey when they are incorrectly disambiguated .. but that's a
source language compiler writer's problem.

What think ye?

Reid.

As a side effect of bug 178 (Stacker not handling 64-bit pointers on
Solaris), I got thinking about a union type for LLVM. Is there any
good reason that LLVM shouldn't support unions? This is essentially a
structure that has its members all at the same address rather than at
sequential addresses. I know there are various issues with unions
(alignment, etc.) but wouldn't it make sense to provide a union type
that deals with all those issues in a platform independent way?

This is intentionally not part of the LLVM type-system, because it is
redundant. If you compile a C program that uses a union, for example, the
C front-end will turn it into a type (often a structure) that contains
only one of the element types (usually the largest one, perhaps modified
to have the correct alignment).

To access the other "parts" of the union, LLVM casts are inserted to
convert it to the appropriate type.

3: % foo = union { int, char* };

Number 3 doesn't exist in LLVM and is what I'm proposing.

A union type is not needed if you encode some simple properties of the
target (like the pointer size) into the bytecode file, which we do with
the C/C++ front-end. The only question then is how to make _portable_
bytecode files with "unions". I'm not really sure what the answer is
here.

I would really like to avoid adding a new union type, as it is not needed
at the LLVM level, and it seems like high-level languages can map
source-level unions onto existing LLVM operations. In Stacker, for
example, would this really solve the problem? You could, for example,
write a program that pushes a pointer, then pops off an int. This would
work fine on a 32-bit target, but obviously not on a 64-bit one.

Since stacker doesn't "protect" its end users (in a memory safety sense),
I think that a 64-bit stacker target should just push 64-bit pointers like
it would push 64 bit integer types: just take up two slots. Is there
anything wrong with this approach? Using a union for the stacker stack on
a 64-bit machine would waste a ton of space when integers are pushed.

While various tests for word sizes and alignment rules could be used,
this problem is _gracefully_ handled by unions. To rewrite the example

The problem with adding unions is that it would require modifying _all of
the LLVM code_ that looks at the type-system, and it doesn't seem like it
gives us anything fundamentally new (like a vector type would). Also,
forcing the front-end to generate casts is an important feature of the
LLVM type-system: it makes it obvious that something non-type-safe is
happening.

If anyone thinks that unions are bad ideas, I challenge you to create a
computer that doesn't support an OR operation. For data structures,
unions fill the same role: structures are AND, unions are OR. Unions
only get dicey when they are incorrectly disambiguated .. but that's a
source language compiler writer's problem.

The problem isn't that we can't effectively represent this, the problem is
that it's not clear what the best way to do it is. :slight_smile:

-Chris

This is intentionally not part of the LLVM type-system, because it is
redundant. If you compile a C program that uses a union, for example, the
C front-end will turn it into a type (often a structure) that contains
only one of the element types (usually the largest one, perhaps modified
to have the correct alignment).

To access the other "parts" of the union, LLVM casts are inserted to
convert it to the appropriate type.

Okay, that's fair. IMO a union type would make compiler writing easier,
but I understand the minimalist approach that LLVM needs to maintain.

A union type is not needed if you encode some simple properties of the
target (like the pointer size) into the bytecode file, which we do with
the C/C++ front-end. The only question then is how to make _portable_
bytecode files with "unions". I'm not really sure what the answer is
here.

Me either :frowning:

I would really like to avoid adding a new union type, as it is not needed
at the LLVM level, and it seems like high-level languages can map
source-level unions onto existing LLVM operations. In Stacker, for
example, would this really solve the problem? You could, for example,
write a program that pushes a pointer, then pops off an int. This would
work fine on a 32-bit target, but obviously not on a 64-bit one.

Stack mistakes resulting from the Stacker source, as in this case, are
the problem of the Stacker programmer, not the compiler. Its possible
but not valid to do the operation you suggested. If you push a pointer,
you should pop it with something that expects a pointer and knows how to
use it correctly. The issue in my mind is by how much one increments the
stack index when a pointer is pushed. The answer in the current
implementation is always "1". That works fine on a 32-bit platform
because the stack array element is 32-bits (int). Both ints and pointers
fit in 32-bits and incrementing the index by 1 moves the index by
32-bits. When you move to a machine that has 64-bit pointers and 32-bit
ints, then this needs to change so that the index is incremented by 2
when a pointer is pushed. There's a number of ways to solve this, the
union type is one of them but I understand your reasons for not wanting
it in the LLVM Assembly language.

Since stacker doesn't "protect" its end users (in a memory safety sense),
I think that a 64-bit stacker target should just push 64-bit pointers like
it would push 64 bit integer types: just take up two slots. Is there
anything wrong with this approach?

Nope. In fact, what I think I'll implement is just a 64-bit stack. That
is, the base type of the stack will be "long" instead of int. LLVM
assures me that this is 64-bits. It is large enough to hold a pointer on
all supported platforms. By doing this, I don't have to mess with the
index increment, its still always 1. This also has the added advantage
of increasing the range of integer values and better supporting floating
point should that become a future feature.

Using a union for the stacker stack on
a 64-bit machine would waste a ton of space when integers are pushed.

Why? The union would still be 64-bits long. If we increase the integer
value size to 64-bits it won't be a waste, it'll be a "feature" :slight_smile:

The problem with adding unions is that it would require modifying _all of
the LLVM code_ that looks at the type-system, and it doesn't seem like it
gives us anything fundamentally new (like a vector type would). Also,
forcing the front-end to generate casts is an important feature of the
LLVM type-system: it makes it obvious that something non-type-safe is
happening.

Okay! I'm convinced!

Reid.

> example, would this really solve the problem? You could, for example,
> write a program that pushes a pointer, then pops off an int. This would
> work fine on a 32-bit target, but obviously not on a 64-bit one.

Stack mistakes resulting from the Stacker source, as in this case, are
the problem of the Stacker programmer, not the compiler. Its possible
but not valid to do the operation you suggested. If you push a pointer,
you should pop it with something that expects a pointer and knows how to
use it correctly.

Ok, so in this case, pushing a 2-word pointer on a 64-bit target would be
acceptable. If we can just figure out how to...

The issue in my mind is by how much one increments the stack index when
a pointer is pushed. The answer in the current implementation is always
"1". That works fine on a 32-bit platform because the stack array
element is 32-bits (int). Both ints and pointers fit in 32-bits and
incrementing the index by 1 moves the index by 32-bits. When you move to
a machine that has 64-bit pointers and 32-bit ints, then this needs to
change so that the index is incremented by 2 when a pointer is pushed.

Ok, good point. Well there is a fairly ugly way in LLVM of computing the
size of a type: through the getelementptr instruction. It works basically
like the traditional implementation of 'offsetof' macro in C. If you want
to get the size of 'sbyte*' for example:

%tmp = getelementptr sbyte** null, long 1 ; tmp = (sbyte**)null+1
%size = cast sbyte** %tmp to int

Of course, in LLVM, these can both be folded together into two constant
expressions, so they will not need to be evaluated at run-time.

> Since stacker doesn't "protect" its end users (in a memory safety sense),
> I think that a 64-bit stacker target should just push 64-bit pointers like
> it would push 64 bit integer types: just take up two slots. Is there
> anything wrong with this approach?

Nope. In fact, what I think I'll implement is just a 64-bit stack. That
is, the base type of the stack will be "long" instead of int. LLVM
assures me that this is 64-bits. It is large enough to hold a pointer on
all supported platforms. By doing this, I don't have to mess with the
index increment, its still always 1. This also has the added advantage
of increasing the range of integer values and better supporting floating
point should that become a future feature.

That would work, but the problem is that it wastes quite a bit of space.
The advantage of that is that it would be much less sensitive to user
errors like pushing a pointer, then popping off an integer though...

> Using a union for the stacker stack on
> a 64-bit machine would waste a ton of space when integers are pushed.

Why? The union would still be 64-bits long. If we increase the integer
value size to 64-bits it won't be a waste, it'll be a "feature" :slight_smile:

Heh, I was assuming that most integers would be 32-bits in size. :slight_smile:

-Chris