variable sized structs in LLVM

Hi LLVM-dev!

I'm having problems figuring out how to do variable sized structs in LLVM (which are neccessary for PyPy's LLVM backend, on which I'm working). I'm trying to do the equivalent of

struct array {
         long refcount;
         long length;
         long items[1];
};

in LLVM, where the items array can be arbitrarily long. I guess that the struct definition should be something like:

%array = type {int, int, [1 x int]}

but how would I allocate such a thing, with the items array being, say, 9 items long? In C I would do something like:

malloc(sizeof(struct array) + ((9 - 1) * sizeof(long)));

but there is no sizeof in LLVM, right? If I try compile C code like that with the LLVM C frontend, I get

%struct.array = type { int, int, [1 x int] }

...

%tmp.0 = malloc [44 x sbyte]
%tmp.5 = cast [44 x sbyte]* %tmp.0 to %struct.array*

It is clear what happens here, but I don't know how I would reproduce that, because I can't easily find out the length in bytes of the array.

I can think of workarounds (like using the struct module of the Python standard library to calculate the sizes) but I have the feeling that I'm missing something and there is a simple way to do that in LLVM.

Thanks for any help,

Carl Friedrich

Carl,

The thing you're missing is that LLVM's primitive types have well known,
fixed sizes that are not target dependent. A ulong is 8 bytes. A uint is
4 bytes. A ushort is 2 bytes. Etc. and always.

There are also methods in LLVM to help you deal with the size of a type
in bits and bytes. In particular you might want to note the following
methods:

Type::isSized
Type::getPrimitiveSize
Type::getPrimitiveSizeInBites

Reid

Hi Reid,

Yes, for that you need TargetData:

http://illuvium.net/docs/doxygen/classllvm_1_1TargetData.html

Reid.

Hi Reid,

Reid Spencer wrote:

Yes, for that you need TargetData:

http://illuvium.net/docs/doxygen/classllvm_1_1TargetData.html

I feared that that would be the answer :-). We're not using the LLVM-API at the moment. We're just generating a .ll file 'by hand', e.g. via simple string operations. I guess we get what we deserve :-).

Thanks for your help,

Carl Friedrich

Its certainly possible to generate .ll files but its probably about the
same amount of work to use the LLVM API and there are significant speed
and validity benefits to doing so.

Here's some ideas on how to handle this situation for the variable sized
structs you need for PyPy:

1. Don't support the feature in your first release (not sure if this is
   viable or not).

2. Make some worst case assumptions about size and alignment (8 bytes)
   and allocate that size, possibly over-allocating memory.

3. Figure out the exact maximum size of the struct so you can declare
   the type correctly.

4. Implement variable sized fields as separately allocated arrays so
   that your example becomes:
    %struct.array = type { int, int, int* }
   The third field is allocated according to the actual size needed.
   This is a bit more code generation but should be fairly efficient.

Reid.

Hi Reid!

Reid Spencer wrote:

Its certainly possible to generate .ll files but its probably about the
same amount of work to use the LLVM API and there are significant speed
and validity benefits to doing so.

I can't really judge that because I've not used the API yet. For now creating .ll files seem like a good solution since string handling is so easy in Python. I guess we will have to use the API at a later point when we start to work on implementing jit compilation/dynamic specialization and want to directly use emitted functions.

Here's some ideas on how to handle this situation for the variable sized
structs you need for PyPy:

1. Don't support the feature in your first release (not sure if this is
   viable or not).

A lot of things are implemented as var-sized struct, e.g. strings are a struct with a hash and the chars, so we can't just leave 'em out.

2. Make some worst case assumptions about size and alignment (8 bytes)
   and allocate that size, possibly over-allocating memory.

yes, this might a quick solution for the next weeks since we don't deallocate _any_ memory at the moment because the garbage collector is not implemented yet :-). The GC is what I'll be working on during the summer.

3. Figure out the exact maximum size of the struct so you can declare
   the type correctly.
4. Implement variable sized fields as separately allocated arrays so
   that your example becomes:
    %struct.array = type { int, int, int* }
   The third field is allocated according to the actual size needed.
   This is a bit more code generation but should be fairly efficient.

I guess we will probably do that.

Thanks for your suggestions,

Carl Friedrich

Reid Spencer wrote:

Its certainly possible to generate .ll files but its probably about the
same amount of work to use the LLVM API and there are significant speed
and validity benefits to doing so.

Does this mean that LLVM is moving away from the idea of a truly abstract IR
language, to being a set of development libraries for use by
build-time-dependent frontends?

LLVM is still an abstract IR. LLVM does not specify struct sizes, the
alignment is a property of the target. In LLVM, you can do
target-independent array and struct element accesses, which get code
generated to correct numeric offsets by the backends which know their
target.

If you are writing an LLVM pass and you absolutely NEED to know the
numeric size of a struct, you must use the TargetData object which is
target-specific. If you want to get the numeric size of a struct
without using the LLVM API, it will require you to eesentially
reimplement TargetData yourself in your programming language of choice.

However, it does not change the fact that LLVM is and continues to be an
abstraction of target machines.

What do you mean by "build-time-dependent frontends"? Frontends can be
completely separate from the LLVM infrastructure and do not even need to
be linked with any LLVM classes to create LLVM code from source
languages.

Misha Brukman wrote:

Reid Spencer wrote:
> Its certainly possible to generate .ll files but its probably about
> the same amount of work to use the LLVM API and there are
> significant speed and validity benefits to doing so.

Does this mean that LLVM is moving away from the idea of a truly
abstract IR language, to being a set of development libraries for use
by build-time-dependent frontends?

[snip]

What do you mean by "build-time-dependent frontends"?

Sorry, I meant frontends that link to LLVM's libraries, as opposed to
spitting out .ll files for consumption by LLVM's assembler.

Frontends can be completely separate from the LLVM infrastructure and do
not even need to be linked with any LLVM classes to create LLVM code from
source languages.

Unless, apparently, you're trying to implement variable-sized objects in
your language. :slight_smile: One of LLVM's powerful (IMHO) selling points was the
concept of a *complete*, abstract IR language which any independent
frontend could write to (sending llvm-as unoptimized, even ugly, but
correct, IR assembly that gets converted to optimized native code), without
being dependent on LLVM's own code, i.e. just write to the IR spec, and be
able to do anything that any frontend linked to LLVM itself, like Reid's
Stacker, could do.

I guess what I'm suggesting is that both the variable struct & data
alignment problems should be at some point abstracted away within LLVM's IR
language, with data alignment issues ideally completely invisible to the
user of LLVM's IR, and with variable structs handled perhaps like GCC & C99
eventually did to support variable sized structs in C. See 'Zero-length
arrays':

One of LLVM's powerful (IMHO) selling points was the
concept of a *complete*, abstract IR language which any independent
frontend could write to (sending llvm-as unoptimized, even ugly, but
correct, IR assembly that gets converted to optimized native code), without
being dependent on LLVM's own code, i.e. just write to the IR spec, and be
able to do anything that any frontend linked to LLVM itself, like Reid's
Stacker, could do.

You can still do that. You just can't make assumptions about sizes and
alignments in a platform-neutral way. FWIW, you couldn't do it in Java
either for much the same reason: the details differ from platform to
platform.

If your language depends on knowing the sizes/alignment/offset of actual
machines then the language is not, itself, machine neutral. However,
there are various techniques youc an use to do what you want. I've
already suggested how you can make variable sized structures by using a
pointer for one of the structure members.

As for Stacker, it doesn't generate LLVM assembly. At least not
directly. Its compiler directly manipulates the C++ IR from which either
bytecode or assembly can be generated.

I guess what I'm suggesting is that both the variable struct & data
alignment problems should be at some point abstracted away within LLVM's IR
language,

They are!

with data alignment issues ideally completely invisible to the
user of LLVM's IR,

They are!

and with variable structs handled perhaps like GCC & C99
eventually did to support variable sized structs in C.

LLVM is not GCC nor C99, its LLVM. The notion doesn't exist in LLVM but
there are alternatives, several already suggested, that you can use
instead. It is likely that "under the covers" GCC and C99 are using
similar techniques.

If you believe that variable sized structs are a feature that is missing
in LLVM, then that's another discussion. However, LLVM needs to remain
"low level" and it is unlikely such a feature would gain much traction
since a combination of existing features can accomplish the same thing.
However, I'll ask other LLVMers weigh in on that discussion.

Reid.

Hi Reid,

If you believe that variable sized structs are a feature that is missing
in LLVM, then that's another discussion. However, LLVM needs to remain
"low level" and it is unlikely such a feature would gain much traction
since a combination of existing features can accomplish the same thing.

The only alternative so far in our case seems to be this one (quoting
from a previous e-mail):

4. Implement variable sized fields as separately allocated arrays so
that your example becomes:
%struct.array = type { int, int, int* }
The third field is allocated according to the actual size needed.
This is a bit more code generation but should be fairly efficient.

If the LLVM optimizers are clever enough to figure out that the arrays
doesn't really have to be allocated separately, but can be inlined
within the allocation of the parent structure, and moreover no pointer
indirection is needed, then it is indeed all that we need.

In this sense, the feature Carl asked for is quite low level; we are
looking for an effect that can be acheived either by optimizers working
on a higher-level description in which the optimization hint cannot be
expressed, or by giving the hint explicitely -- which so far doesn't
seem possible portably in an assembler source.

Armin

Hi Reid,

> If you believe that variable sized structs are a feature that is missing
> in LLVM, then that's another discussion. However, LLVM needs to remain
> "low level" and it is unlikely such a feature would gain much traction
> since a combination of existing features can accomplish the same thing.

The only alternative so far in our case seems to be this one (quoting
from a previous e-mail):

> 4. Implement variable sized fields as separately allocated arrays so
> that your example becomes:
> %struct.array = type { int, int, int* }
> The third field is allocated according to the actual size needed.
> This is a bit more code generation but should be fairly efficient.

If the LLVM optimizers are clever enough to figure out that the arrays
doesn't really have to be allocated separately, but can be inlined
within the allocation of the parent structure, and moreover no pointer
indirection is needed, then it is indeed all that we need.

I don't know if LLVM currently will do this or not, but it is unlikely.
I'll defer to someone who knows the optimizations better. However, what
I do know is that such an optimization could quite easily be written.
The situation is quite identifiable in just a few lines of LLVM C++ IR
code and the transformation is similarly simple. So, one approach would
be to use the IR as is (as you quoted above) and write an optimization
to do what you want (turn two allocations into one).

In this sense, the feature Carl asked for is quite low level; we are
looking for an effect that can be acheived either by optimizers working
on a higher-level description in which the optimization hint cannot be
expressed, or by giving the hint explicitely -- which so far doesn't
seem possible portably in an assembler source.

Agreed. Chances are the optimization I suggested above could be written
fairly easily. The change to the assembly language to provide the "hint"
or alternatively directly support variable sized structures is a much
larger task.

I think we should wait until Chris and others can weigh in on this,
however. Chris will be off-line until 6/26 or thereabouts. If you can
defer your implementation for a few more days, perhaps we can work out a
solution that works for everyone.

Reid.

I'm having problems figuring out how to do variable sized structs in LLVM (which are neccessary for PyPy's LLVM backend, on which I'm working). I'm trying to do the equivalent of

...

in LLVM, where the items array can be arbitrarily long. I guess that the struct definition should be something like:

%array = type {int, int, [1 x int]}

but how would I allocate such a thing, with the items array being, say, 9 items long? In C I would do something like:

malloc(sizeof(struct array) + ((9 - 1) * sizeof(long)));

but there is no sizeof in LLVM, right?

sizeof in llvm is really 'offsetof from the null pointer'. See below for an example. In short, yes we do have it, though it's not super-intuitive.

If I try compile C code like that with the LLVM C frontend, I get

%struct.array = type { int, int, [1 x int] }

...

%tmp.0 = malloc [44 x sbyte]
%tmp.5 = cast [44 x sbyte]* %tmp.0 to %struct.array*

This is almost exactly what you want to do. Please make the array be [0 x int] though, as it is undefined in llvm to access past the end of an array with non-zero length. I added a note about zero-length arrays here:

http://llvm.cs.uiuc.edu/docs/LangRef.html#i_getelementptr
http://llvm.cs.uiuc.edu/docs/LangRef.html#t_array

It is clear what happens here, but I don't know how I would reproduce that, because I can't easily find out the length in bytes of the array.

I'm not sure I understand. Are you saying that, at runtime, you need ot know the actual length of the array? If so, you should store this information as one of the elements of the struct before the array. If you have a pascal style array for example, you could do something like this:
    { uint, [0 x float] }

... storing the actual dynamic size of the array in the uint member.

As you mention above, you should use malloc and cast the return value. If you want to get the desired size in a target-independent way (always good), you should do something like this:

%array = type { int, int, [0 x int] }
implementation
%array *%NewArray(int %Length) {
   ;; Get the offset of the 'Length' element of the array from the null
   ;; pointer.
   %Size = getelementptr %array* null, int 0, uint 2, int %Length
   %SizeU = cast int* %Size to uint
   %Ptr = malloc sbyte, uint %SizeU
   %Result = cast sbyte* %Ptr to %array*
   ret %array* %Result
}

I can think of workarounds (like using the struct module of the Python standard library to calculate the sizes) but I have the feeling that I'm missing something and there is a simple way to do that in LLVM.

This shouldn't be necessary.

-Chris

Reid Spencer wrote:

Yes, for that you need TargetData:

http://illuvium.net/docs/doxygen/classllvm_1_1TargetData.html

I feared that that would be the answer :-). We're not using the LLVM-API at the moment. We're just generating a .ll file 'by hand', e.g. via simple string operations.

This is just fine. The only advantage to using the llvm C++ API's are compile time speed and integration with the JIT. If neither of those is a priority yet, emitting text is a great way to go!

I guess we get what we deserve :-).

Not at all, this is a great way to go.

-Chris

No, not at all.

-Chris

Frontends can be completely separate from the LLVM infrastructure and do
not even need to be linked with any LLVM classes to create LLVM code from
source languages.

Unless, apparently, you're trying to implement variable-sized objects in
your language. :slight_smile:

Heh, hopefully my previous emails help clear this up. Variable sized objects should not be a problem for llvm.

One of LLVM's powerful (IMHO) selling points was the concept of a *complete*, abstract IR language which any independent frontend could write to (sending llvm-as unoptimized, even ugly, but correct, IR assembly that gets converted to optimized native code), without being dependent on LLVM's own code, i.e. just write to the IR spec, and be able to do anything that any frontend linked to LLVM itself, like Reid's Stacker, could do.

Yup!

I guess what I'm suggesting is that both the variable struct & data
alignment problems should be at some point abstracted away within LLVM's IR
language, with data alignment issues ideally completely invisible to the
user of LLVM's IR, and with variable structs handled perhaps like GCC & C99
eventually did to support variable sized structs in C. See 'Zero-length
arrays':

Zero Length (Using the GNU Compiler Collection (GCC))

Yup, I agree. If my previous email didn't clear this up, please lemme know :slight_smile:

-Chris

One of LLVM's powerful (IMHO) selling points was the
concept of a *complete*, abstract IR language which any independent
frontend could write to (sending llvm-as unoptimized, even ugly, but
correct, IR assembly that gets converted to optimized native code), without
being dependent on LLVM's own code, i.e. just write to the IR spec, and be
able to do anything that any frontend linked to LLVM itself, like Reid's
Stacker, could do.

You can still do that. You just can't make assumptions about sizes and
alignments in a platform-neutral way. FWIW, you couldn't do it in Java
either for much the same reason: the details differ from platform to
platform.

Yup.

If your language depends on knowing the sizes/alignment/offset of actual
machines then the language is not, itself, machine neutral.

This is true.

However, there are various techniques youc an use to do what you want. I've already suggested how you can make variable sized structures by using a pointer for one of the structure members.

This is one way to do it, but it is not the only way. This is also less efficient for obvious reasons. It's better to just use variable sized arrays.

As for Stacker, it doesn't generate LLVM assembly. At least not
directly. Its compiler directly manipulates the C++ IR from which either
bytecode or assembly can be generated.

I really don't see what this has to do with anything. You're confusing what the LLVM API's provide and what the LLVM language/IR provides. Nothing in stacker depends on the LLVM API's: it could be written as a perl script that outputs .ll files.

and with variable structs handled perhaps like GCC & C99
eventually did to support variable sized structs in C.

LLVM is not GCC nor C99, its LLVM.

This is true, but obviously LLVM wants to support C99 and other things that GCC does! :slight_smile:

The notion doesn't exist in LLVM

This is not true.

but there are alternatives, several already suggested, that you can use instead. It is likely that "under the covers" GCC and C99 are using similar techniques.

Not true, but for a different reason. C/C++ are already not target-independent (e.g. you can write code that detects endianness, header files are system specific, etc). Because of this, the llvm-gcc front-end doesn't really even try to emit super portable llvm code. As such, when using C VLA's, they just get lowered to allocations of exactly the right number of bytes, without the LLVM portable 'sizeof' construct being used.

If you believe that variable sized structs are a feature that is missing
in LLVM, then that's another discussion. However, LLVM needs to remain
"low level" and it is unlikely such a feature would gain much traction
since a combination of existing features can accomplish the same thing.

Fortunately, nothing in LLVM has to be there explicitly to support the feature. It's one of those high-level things that can be expressed with the low-level things we already have. :slight_smile:

-Chris

Hi Reid,

If you believe that variable sized structs are a feature that is missing
in LLVM, then that's another discussion. However, LLVM needs to remain
"low level" and it is unlikely such a feature would gain much traction
since a combination of existing features can accomplish the same thing.

The only alternative so far in our case seems to be this one (quoting
from a previous e-mail):

4. Implement variable sized fields as separately allocated arrays so
that your example becomes:
%struct.array = type { int, int, int* }
The third field is allocated according to the actual size needed.
This is a bit more code generation but should be fairly efficient.

ick.

If the LLVM optimizers are clever enough to figure out that the arrays
doesn't really have to be allocated separately, but can be inlined
within the allocation of the parent structure, and moreover no pointer
indirection is needed, then it is indeed all that we need.

In simple cases, yes we could do something like this. In general though, this is a very hard problem and not something your front-end wants to depend on for something as core as arrays and strings.

In this sense, the feature Carl asked for is quite low level; we are
looking for an effect that can be acheived either by optimizers working
on a higher-level description in which the optimization hint cannot be
expressed, or by giving the hint explicitely -- which so far doesn't
seem possible portably in an assembler source.

Agreed. If my previous emails don't clarify the issue, please let me know.

Thanks!

-Chris

This is almost exactly what you want to do. Please make the array be [0 x int] though, as it is undefined in llvm to access past the end of an array with non-zero length. I added a note about zero-length arrays here:

http://llvm.cs.uiuc.edu/docs/LangRef.html#i_getelementptr
http://llvm.cs.uiuc.edu/docs/LangRef.html#t_array

As you mention above, you should use malloc and cast the return value. If you want to get the desired size in a target-independent way (always good), you should do something like this:

%array = type { int, int, [0 x int] }
implementation
%array *%NewArray(int %Length) {
  ;; Get the offset of the 'Length' element of the array from the null
  ;; pointer.
  %Size = getelementptr %array* null, int 0, uint 2, int %Length
  %SizeU = cast int* %Size to uint
  %Ptr = malloc sbyte, uint %SizeU
  %Result = cast sbyte* %Ptr to %array*
  ret %array* %Result
}

Chris you should maybe add this example to the getelementptr documentation.

Aaron