Platform independent Union in LLVM-IR? Or at least constant sizeof?

Hello.

I have to emit LLVM IR from StandardML, which has no LLVM bindings, so I am generating textual IR representation. I now have to translate a union type. The union type feature seems to have some history in LLVM, but, currently, no union is supported. The proposed workaround is to define a type of the correct size, and use type-casts. OK, if I only could define a type of correct size in a platform independent way. It seems that the only thing I can write for n in, e.g., [ n x i8 ], is a literal integer. Constant expressions (using the gep-trick to get the size) do not work.

This results in two questions:

  1. am I overlooking something, and is there a way to generate platform independent IR for union types?

  2. if not, can I somehow avoid re-implementation of the sizeof computation in StandardML, as it introduces redundancies and potential for subtle errors.


Peter

LLVM IR is inherently platform-dependent. Type sizes depend among other things on the DataLayout.

This is probably not what you want to hear, but frankly the only clean solution to what you want seems to be adding LLVM bindings. Perhaps it’s not too bad if you start with the C bindings?

Thanks for your answer.

There seems to be a quite large set of LLVM that can be done platform-independently, and other things are almost platform independent, without any obvious (to me) reasons why the straightforward and small gap is not bridged. The union type seems to be one of them. Not knowing exactly when LLVM compilation transitions from platform independent to platform dependent, but exposing some platform specific information to the LLVM IR should not be difficult (given that, currently, this information is already known by the compiler).

For example, a %t = type [max(sizeof(%variant1, …, %variantN)) x i8] seems to be all that is required to implement unions in a platform independent way. And, given that I can easily declare this using the C bindings, I don’t see any good reason that prevents one from exposing this to the bytecode.

Btw, is there a list of ‘platform specific’ aspects of LLVM IR. Those aspects in the scope of LLVM IR would be enough, e.g., sizeof would be on the list, but different standard libraries wouldn’t.

From a design perspective, coming up with own bindings for a programming language (like SML) is a very maintenance intensive task, which I do not have the resources for. E.g., there is some port to SML, but it’s unmaintained for 8 years now(GitHub - melsman/sml-llvm: Standard ML Bindings for LLVM). On the other hand, the (textual) LLVM-IR format seems to be a quite stable target, in particular when considering the fragment needed for compiler front ends. This makes LLVM IR appealing also for smaller (research) projects, that cannot always use mainstream implementation languages, and have very limited resources for maintenance tasks.

Before I encountered the problem with unions, I only needed one platform specific aspect to consider: The integer type size_t, that can take the difference of two pointers.


Peter

There seems to be a quite large set of LLVM that can be done platform-independently, and other things are almost platform independent, without any obvious (to me) reasons why the straightforward and small gap is not bridged. The union type seems to be one of them. Not knowing exactly when LLVM compilation transitions from platform independent to platform dependent, but exposing some platform specific information to the LLVM IR should not be difficult (given that, currently, this information is already known by the compiler).

LLVM is designed to be a low-level representation (it used to be part of its name, after all). It is closer to assembly than it is to C, and over time, it has only grown more so–for example, LLVM has recently switched pointer types to no longer specifying what type they point to. Where LLVM shares features with C, it is not necessarily guaranteed that they behave the same way–for example, LLVM struct types are not necessarily laid out in the same way that the C ABI would lay out an equivalent C struct. (Although it has been argued from time to time that LLVM should provide a C ABI support library for mapping C types to LLVM, this does not yet exist).

For example, a %t = type [max(sizeof(%variant1, …, %variantN)) x i8] seems to be all that is required to implement unions in a platform independent way. And, given that I can easily declare this using the C bindings, I don’t see any good reason that prevents one from exposing this to the bytecode.

What could you do with such a type? The only reasonable thing I could see is point to it, and as mentioned above, pointer types are becoming opaque in LLVM anyways, such a type would have no value.

Before I encountered the problem with unions, I only needed one platform specific aspect to consider: The integer type size_t, that can take the difference of two pointers.

If you want to be pedantic, the difference of pointers is ptrdiff_t, which isn’t necessarily the same as size_t or ssize_t. A nonexhaustive list of issues in representation of C in LLVM includes potential mismatch in type ABIs (I believe some triples have different alignments for i128 in LLVM versus C), calling conventions (especially where structs are involved), and variable arguments, and this is just 1 minute of me thinking about issues off the top of my head so I’m sure I’m missing several.

This type could be part of another aggregate type, in first place.
Also, I could have registers of this type, or function arguments and return values. But your concern may be right, casting a union field to/from a register of type [n x i8] would require some dirty looking IR using alloca, store, pointer cast, and then load. No idea how LLVM’s optimizations would handle that.

– Peter