[PATCH] - Union types, attempt 2

I'm skeptical that you *really* want to (i.e. that you wouldn't
be better off just writing helper functions in your front-end
which do the addressing and load/store and then moving on).
But, I'm not really interested in getting in the way here.

Dan

Talin schrieb:

Well, the fact that union members have to be indexed by number means that the ordering has to be part of the type - so even though type-theoretically union { i32, float } is the same as union { float, i32 }, in my implementation they are distinct types. However, from the standpoint of a frontend, this is not a great concern, because the frontend will most likely sort the list of types before constructing the IR type.

Hm… it’s placing a burden on the frontend developer.

More importantly, it’s something that the fronend developer must not forget to do, so you better make sure this is documented in capital letters in a place where the frontend developer is likely to look when preparing code generation.

Most importantly, however, this will create a lot of hassles when making code interoperable between compilers: Compiler writers need to agree on a language-independent canonical ordering.
That said, if the ordering is canonical, it could be established at the IR level. E.g. by ordering alphabetically.

When coding, please consider that many languages establish assignment compatibility between union types. E.g. a union {i32, float} value could be assigned to a name that’s typed as a union {i32, i64, float}.
This probably means the need for conversion operators, and it definitely means that indexes aren’t meaningful by themselves, only in conjunction with their union type.

I really feel that these issues should be addressed on a layer above IR. LLVM IR always requires that all types match exactly, and any conversions or promotions must be inserted explicitly by the frontend. Making unions do automatic conversions would make them dramatically different from every other IR type.

I agree. I probably shouldn't even comment, as I know so little about LLVM. But I've hand-written a couple kLOC of IR now and am starting to get a feel for the syntax, so I'll just say what "feels" right based on that and leave it to others to decide if I've absorbed enough to make any kind of sense.

Just imagining myself using such a language extension, I really would not want an ordering imposed where no natural one exists. Indices feel very wrong. Isn't a union basically just a convenient alternate interface to the various other conversion operators like bitcast, inttoptr, trunc, zext, and the rest? (In fact that's how I manipulate my expressions, the three-bit tag in the low-order bits tell me how to treat the high-order bits.) The "index" doesn't (generally) represent any kind of offset, but rather an interpretation of the bits, and none of the offset arithmetic implied by getelementptr or physical register choice implied by extractvalue will occur (except perhaps to satisfy alignment constraints, but that would be architecture dependent and I assume should therefore be invisible). Correct?

If that argument is persuasive, then the following seems a bit more consistent with the existing syntax:

     ; Manipulation of a union register variable
     %myUnion = unioncast i32, %myValue to union {i32, float}
     %fieldValue = unioncast union {i32, float} %myUnion to i32
     ; %fieldValue == %myValue

This specialized union cast fits the pattern of having specialized cast operations between value and pointer as opposed to two values or two pointers.

That's enough, as you could require that unions be loaded and stored as unions and then elements extracted. But if you want to make it a bit less syntactically noisy, and also allow the same flexibility that getelementptr would allow in accessing a single member through a pointer, you could allow

     ; Load/store of one particular union field
     store i32 %myValue, union {i32, float}* %myUnionPtr
     %fieldValue = load union {i32, float}* %myUnionPtr as i32
     ; %fieldValue == %myValue

Where I've added a preposition 'as' to the load instruction by analogy with what the cast operators do with 'to'.

I don't know that I'd argue the point much, but offhand it "feels" consistent with the rest of the syntax to have a specialized 'unioncast' operator analogous with the other specialized conversions, but overload load/store as I illustrated so that pointers to unions are conceptually just funny kinds of pointers to their fields (which they are). So in that vein, if you want a pointer to one of the alternatives in the union you'd just cast one pointer to another; to avoid alignment adjustments on what is supposed to be a no-op that cast probably shouldn't be bitcast. So what about

     %intPtr = unioncast union {i32, float}* %myUnionPtr to i32*
     %newUnionPtr = unioncast i32* %intPtr to union {i32, float}*
     ; %newUnionPtr == %myUnionPtr

I'm not necessarily advocating overloading one keyword ('unioncast') that way, though I note that it should always be unambiguous based on whether the operands are values or pointers (LLVM seems to have a strong notion of what is and is not a pointer, so this makes some kind of conceptual sense to me). Whether it's OK to create two new keywords is perhaps too fine a detail for me to have a good sense of. What would matter to me is not imposing order on unordered interpretations.

Dustin

Let me give you a use case then:

Say I have a function which returns either a floating-point number or an error code (like divide by zero or something). The way that I would represent this return result is:

{ i1, union { float, i32 } }

In other words, what we have is a small struct that contains a one-bit discriminator field, followed by a union of float and i32. The discriminator field tells us what type is stored in the union - 0 = float, 1 = i32, so this is a typical ‘tagged’ union. (We can also have untagged or “C-style” unions, as long as the programmer has some other means of knowing what type is stored in the union.)

Using a union here (as opposed to using bitcast) solves a number of problems:

  1. The size of the struct is automatically calculated by taking the largest field of the union. Without unions, your frontend would have to calculate the size of each possible field, as well as their alignment, and use that to figure the maximum structure size. If your front-end is target-agnostic, you may not even know how to calculate the correct struct size.

  2. The struct is small enough to be returned as a first-class SSA value, and with a union you can use it directly. Since bitcast only works on pointers, in order to use it you would have to alloca some temporary memory to hold the function result, store the result into it, then use a combination of GEP and bitcast to get a correctly-typed pointer to the second field, and finally load the value. With a union, you can simply extract the second field without ever having to muck about with pointers and allocas.

  3. The union provides an additional layer of type safety, since you can only extract types which are declared in the union, and not any arbitrary type that you could get with a bitcast. (Although I consider this a relatively minor point since type safety isn’t a major concern in IR.)

  4. It’s possible that some future version of the optimizer could use the additional type information provided by the union which the bitcast does not. Perhaps an optimizer which knows that all of the union members are numbers and not pointers could make some additional assumptions…

  1. Something I forgot to mention - by allowing GEP and extractvalue to work with unions, we can handle unions nested inside structs and vice versa with a single GEP instruction. This is my main argument against having special instructions for dealing with unions.

For example, in the case of { i1, union { float, i32 } }* we can use a GEP with indices [0, 1, 0] to get access to the float field in a single GEP instruction.

So just as GEP allows chaining together operations on structs, pointers and arrays, we can also chain them together with operations on unions. This can be quite powerful I think.

Dustin Laurence wrote:

     Yes, that's closer to the frontend semantics: the variants of a
     union type don't have any natural ordering, so list semantics could
     cause problems.

I agree. I probably shouldn't even comment, as I know so little about
LLVM. But I've hand-written a couple kLOC of IR now and am starting to
get a feel for the syntax, so I'll just say what "feels" right based on
that and leave it to others to decide if I've absorbed enough to make
any kind of sense.

Just imagining myself using such a language extension, I really would
not want an ordering imposed where no natural one exists. Indices feel
very wrong. Isn't a union basically just a convenient alternate
interface to the various other conversion operators like bitcast,
inttoptr, trunc, zext, and the rest?

Almost, but you're forgetting one important attribute: you can 'alloca' a union type and get something the size of the largest entry. This way, you can allocate a union of {i32, i8} and i8* without knowing in your frontend whether your system has 32 or 64-bit pointers. This is important to people who want to write fully platform neutral code in LLVM.

Nick

OK, but how does ordering an unordered "bag of alternatives" help that? I wasn't trying to imply that union wasn't useful because you could just use the other conversions, though I see I worded it so it sounds that way. I just meant that the instructions for conversions seemed like a better model for manipulating unions than structures. I suspect it is C syntax that makes us think of structs and unions together, and I was trying to defeat my own tendency to do that by using a different model.

The last time I felt like checking my little lisp code with it's numbers stuffed into pairs of pointer-sized words would build on either 32 or 64-bit x86. But I commited some sins to more or less eliminate word size dependence, at least sins against taste. Reducing the temptation to such sin seems worthwhile. :slight_smile:

Dustin

Using a union here (as opposed to using bitcast) solves a number of problems:

1) The size of the struct is automatically calculated by taking the largest field of the union. Without unions, your frontend would have to calculate the size of each possible field, as well as their alignment, and use that to figure the maximum structure size. If your front-end is target-agnostic, you may not even know how to calculate the correct struct size.

2) The struct is small enough to be returned as a first-class SSA value, and with a union you can use it directly. Since bitcast only works on pointers, in order to use it you would have to alloca some temporary memory to hold the function result, store the result into it, then use a combination of GEP and bitcast to get a correctly-typed pointer to the second field, and finally load the value. With a union, you can simply extract the second field without ever having to muck about with pointers and allocas.

3) The union provides an additional layer of type safety, since you can only extract types which are declared in the union, and not any arbitrary type that you could get with a bitcast. (Although I consider this a relatively minor point since type safety isn't a major concern in IR.)

4) It's possible that some future version of the optimizer could use the additional type information provided by the union which the bitcast does not. Perhaps an optimizer which knows that all of the union members are numbers and not pointers could make some additional assumptions...

5) Something I forgot to mention - by allowing GEP and extractvalue to work with unions, we can handle unions nested inside structs and vice versa with a single GEP instruction. This is my main argument against having special instructions for dealing with unions.

For example, in the case of { i1, union { float, i32 } }* we can use a GEP with indices [0, 1, 0] to get access to the float field in a single GEP instruction.

So just as GEP allows chaining together operations on structs, pointers and arrays, we can also chain them together with operations on unions. This can be quite powerful I think.

Yes, this is all very compelling to me. Beyond all this, we don't support bitcast of aggregate values.

-Chris